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METHODS AND COMPOSITIONS UTILIZING HYBRID EXACT ROTAMER OPTIMIZATION 
ALGORITHMS FOR PROTEIN DESIGN 

This application is a continuing application of U.S. S.N. 09/866,511, filed May 24, 2001. 

FIELD OF THE INVENTION 

The present invention relates to an apparatus and method for quantitative protein design and 
optimization. In particular, the invention describes the use of Hybrid Exact Rotamer Optimization 
algorithms in protein design. 

BACKGROUND OF THE INVENTION 

De novo protein design has received considerable attention recently, and significant advances have 
been made toward the goal of producing stable, well-folded proteins with novel sequences. Efforts to 
design proteins rely on knowledge of the physical properties that determine protein structure, such as 
the patterns of hydrophobic and hydrophilic residues in the sequence, salt bridges and hydrogen 
bonds, and secondary structural preferences of amino acids. Various approaches to apply these 
principles have been attempted. For example, the construction of a-helical and 3-sheet proteins with 
native-like sequences was attempted by individually selecting the residue required at every position in 
the target fold (Hecht, et a/.. Science 249:884-891 (1990); Quinn, etal., Proc. Natl. Acad. Sci USA 
91:8747-8751 (1994)). Alternatively, a minimalist approach was used to design helical proteins, where 
the simplest possible sequence believed to be consistent with the folded structure was generated 
(Regan, etal.. Science 241:976-978 (1988); DeGrado, ef a/., Science 243:622-628 (1989); Handel, et 
a/., Science 261:879-885 (1993)), with varying degrees of success. An experimental method that 
relies on the hydrophobic and polar (HP) pattern of a sequence was developed where a library of 
sequences with the correct pattern for a four helix bundle was generated by random mutagenesis 
(Kamtekar, etal., Science 262:1680-1685 (1993)). Among non de novo approaches, domains of 
naturally occurring proteins have been modified or coupled together to achieve a desired tertiary 
organization (Pessi, etal., Nature 362:367-369 (1993); Pomerantz, etal., Science 267:93-96 (1995)). 




Though the correct secondary structure and overall tertiary organization seem to have been attained 
by several of the above techniques, many designed proteins appear to lack the structural specificity of 
native proteins. The complementary geometric arrangement of amino acids in the folded protein is the 
root of this specificity and is encoded in the sequence. 

5 

Several groups have applied and experimentally tested systematic, quantitative methods to protein 
design with the goal of developing general design algorithms (Hellinga, ef a/., J. Mol. Biol. 222: 763- 
785 (1991); Hurley, etal., J. Mol. Biol. 224:1143-1154 (1992); Desjarlais, etal.. Protein Science 
4:2006-2018 (1995); Harbury, etal.. Proc. Natl. Acad. Sci. USA 92:8408-8412 (1995); Klemba, era/., 
10 Nat. Struc. Biol. 2:368-373 (1995); Nautiyal, ef a/., Biochemistry 34:1 1645-11651 (1995); Betzo, etal., 
Biochemistry 35:6955-6962 (1996); Dahiyat, etal., Protein Science 5:895-903 (1996); Jones, Protein 
Science 3:567-574 (1994); Konoi, et al., Proteins: Structure, Function and Genetics 19:244-255 
(1994); Dahiyat and Mayo, Science, 278:82 (1997); Dahiyat and Mayo, Prot. Sci. 5:895 (1996); Gordon 
and Mayo, J. Comput. Chem. 18:1505 (1998); Dahiyat, etal., Prot. Sci., 6:1333 (1997); Street and 
161 Mayo, Foding Des., 3:253 (1998); Gordon et al., Curr. Opin. Struct. Biol., 9:509 (1999)). These 

"t; algorithms consider the spatial positioning and steric complementarity of side chains by explicitly 
modeling the atoms of sequences under consideration. To date, such techniques have typically 

53 focused on designing the cores of proteins and have scored sequences with van der Waals and 

f . sometimes hydrophobic solvation potentials. 

20: 

f\ Recent studies using coiled coils have demonstrated that core side-chain packing can be combined 

y. with explicit backbone flexibility (Harbury et al., PNAS USA 92:8408-8412 (1995); Offer & Sessions, J. 

fU Mol. Biol. 249:967-987 (1 995). In these cases, the goal was to search for backbone coordinates that 

l,. satisfied a fixed amino acid sequence. 



In addition, the qualitative nature of many design approaches has hampered the development of 
improved, second generation, proteins because there are no objective methods for learning from past 
design successes and failures. 

30 Thus, it is an object of the invention to provide computational protein design and optimization via an 
objective, quantitative design technique implemented in connection with a general purpose computer. 



35 SUMMARY OF THE INVENTION 

In accordance with the objects outlined above, the present invention provides methods executed by a 
computer under the control of a program, the computer including a memory for storing the program. 
The methods comprise the steps of receiving a protein backbone structure with variable residue 
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positions, establishing a group of potential rotamers for each of the variable residue positions, wherein 
at least one variable residue position has rotamers from at least two different amino acid side chains, 
and analyzing the interaction of each of the rotamers with all or part of the remainder of the protein 
backbone structure to generate a set of optimized protein sequences. The methods further comprise 
5 classifying each variable residue position as either a core, surface or boundary residue. The analyzing 
step may include a Hybrid Exact Rotamer Optimization (HERO) computation either alone or in 
combination with a Branch and Terminate (B&T) computation. The HERO computational step may 
include the use of split singles Dead End Elimination(DEE) or split singles DEE with split flags. 

1 0 Generally, the analyzing step includes the use of at least one scoring function selected from the group 
consisting of a Van der Waals potential scoring function, a hydrogen bond potential scoring function, 
an atomic solvation scoring function, a secondary structure propensity scoring function and an 
electrostatic scoring function. Additionally, the scoring functions may be modified to emphasize 
"singles" energy (i.e., rotamer-backbone) terms. The methods further comprise altering the protein 

1 5 backbone prior to the analysis, comprising altering at least one supersecondary structure parameter 

value. The methods may further comprise generating a rank ordered list of additional optimal 
"t: sequences from the globally optimal protein sequence. Some or all of the protein sequences from the 
ordered list may be tested to produce potential energy test results. 

20 In an additional aspect, the invention provides nucleic acid sequences encoding a protein sequence 

generated by the present methods, and expression vectors and host cells containing the nucleic acids. 

In a further aspect, the invention provides a computer readable memory to direct a computer to 
function in a specified manner, comprising a side chain module to correlate a group of potential 
25 rotamers for residue positions of a protein backbone model, and a ranking module to analyze the 

interaction of each of said rotamers with all or part of the remainder of said protein to generate a set of 
optimized protein sequences. The memory may further comprise an assessment module to assess 
the correspondence between potential energy test results and theoretical potential energy data. 

30 BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates a general purpose computer configured in accordance with an embodiment of the 
invention. 

35 Figure 2 illustrates processing steps associated with an embodiment of the invention. 

Figure 3A illustrates processing steps associated with a ranking module used in accordance with an 
embodiment of the invention. After any DEE step, any one of the previous DEE steps may be 
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repeated. In addition, any one of the DEE steps may be eliminated; for example, original singles DEE 
need not be run. 



Figures 3B, 3C and 3D illustrate the processing steps which may comprise HERO. 

5 

Figure 4 is a schematic representation of the minimum and maximum quantities (defined in Equations 
24-27) that are used to construct speed enhancements. The minima and maxima are utilized directly 
to find the (iJJ mb pair and for the comparison of the extrema. The differences between the quantities, 
denoted with arrows, are used to construct the q rs and q uv metrics. 

10 

Figures 5A, 5B, 5C and 5D depict several super-secondary structure parameters for or/p proteins. 
The definitions are similar to those previously developed for oc/p proteins (Janin & Chothia, J Mol Biol 
143:95-128 (1980); Cohen et al., J Mol Biol 156:821-862 (1982)). The helix center is defined as the 
average C a position of the residues in the helix. The helix axis is defined as the principal moment of 
1§=, the C a atoms of the residues in the helix. (Chothia et al., Proc Natl Acad Sci USA 78:4146-4150 
•J: (1981); J Mol Biol 145:215-250 (1981). The strand axis is defined as the average of the least-squares 

lines fit through the midpoints of sequential C„ positions of two centra! P-strands. The sheet plane is 
~ defined as the least-squares plane fit through the C a positions of the residues of the sheet. The sheet 
axis is defined as the vector perpendicular to the sheet plane that passes through the helix center. Q 
20 is the angle between the strand axis and the helix axis after projection onto the sheet plane; 0 is the 
* angle between the helix axis and the sheet plane; h is the distance between the helix center and the 
f*' sheet plane; o is the rotation angle about the helix axis. The super-secondary structure parameter 
\ values for native GP1 are Q = -26.49°, 9 = 3.20°, h = 1 0.04 A and o = 0°. 

2& Figures 6A, 6B, 6C and 6D depict four supersecondary structure parameters for p/p protein 

interactions. Figures 6A and 6B are relevant to p barrel proteins; Figure 6C is relevant to P-sheet 
interactions. Figure 6A shows only three strands, and depicts R, the barrel radius; a, the tilt of the 
strands relative to the barrel axis; a, the distance from C a to C a along the strands; and b, the 
interstrand distance. Figure 6B shows the twist and coiling angles of the P-sheet, with residues A, B 

30 and C from one strand, residues D, E and F in strand 2, and residues G, H and I from strand 3. The 
circles represent the positions of the residues when projected onto the surface of the barrel. In this 
case, 0 is the mean twist of the p-sheet about an axis perpendicular to the strand direction, t is the 
mean twist of the P-sheet about an axis parallel to the strand direction, e is the mean coiling of the P- 
sheet along the strands, n is the mean coiling of the P-sheet along a line perpendicular to the strands. 

35 Figure 6C depicts two P-sheets, with the chain direction being shown with arrows. Figure 6D depicts 
two P-sheets of distance h with angle 6 between the average strand vectors. There is also <£, 
perpendicular to vectors defining 0. 
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Figures 7A, 7B, 7C and 7D depict four supersecondary structure parameters a/a supersecondary 
structure parameters for a/a interactions, d is the distance between the helices and 6 is the angle 
between the axes of the helices, o is defined as the rotation around the helix axis. Q is the angle 
between two strand axes after projection onto a plane. In Figures 7C and 7D, the dark circle 
5 represents a view of the helix from the end. 

Figure 8 illustrates the convergence of HERO compared to DEE (s = 2 mb ) for 37 core residues of a 
designed leucine rich repeat backbone. 

10 Figures 9A and 9B illustrate rotamer notation for dead-end elimination. The thick line represents the 
protein backbone, the filled circles indicate residue positions, and the thin lines emanating from each 
residue are side-chain rotamers. Figure 9A depicts original and Goldstein DEE. Figure 9B depicts 
simple split DEE (s = 1). 

1 5 Figures 1 0 A-1 0D illustrate different dead-end elimination criteria for sample energy profiles. The 
I abscissa represents all possible conformations of the protein and the ordinate describes the net 

energy contribution produced by interactions with specific rotamers at position /. Figure 10A depicts 
original DEE: i r is eliminated by / t1 but not by i a . Figure 10B depicts simple Goldstein DEE: i r is 
eliminated by either / t1 or i a . Figure 10C depicts general Goldstein DEE: \ r cannot be eliminated by 

20 either / t1 or i a , but can be eliminated by a weighted average of the two. Figure 1 0D depicts bottom line 
DEE: theoretically, \ r can be eliminated if the minimum of the i„ and i a profiles always falls below the i r 
profile. 

Figures 1 1 A - 1 1 D illustrate application of split DEE to sample energy profiles. 

25 

Figure 12 depicts the cost comparisons of different flagging approaches. 

Figure 13. Comparison of bounding and pairs energies during a bound flags iteration of the 
plastocyanin core calculation of Case 1. The reference energy obtained by a Monte Carlo calculation 
30 is shown as a horizontal line. All pairs with a bounding energy above the line may be flagged. 

Figures 14A - 14E depict DEE and HERO convergence results. Figure 14A illustrates Case 1 , full 
core design of plastocyanin. Figure 14B illustrates Case 2, full core design of novel repeating 
backbone. Figure 14C illustrates Case 3, full core design of the variable domains of the light and 
35 heavy chains of a catalytic antibody. Figure 14D illustrates Case 4, full core and boundary design of 
the (31 domain of Protein G. Figure 14E illustrates Case 5, full surface design of the (31 domain of 
Protein G. 
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Figure 15 illustrates convergence comparison for HERO on a full sequence design of the p1 domain of 
Protein G using the experimentally validated "standard" potential function and rotamer library, a 
modified potential function with the standard rotamer library and the standard potential function with a 
reduced rotamer library. 

5 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention is directed to the quantitative design and optimization of amino acid sequences, 
using an "inverse protein folding" approach, which seeks the optimal sequence for a desired structure. 
10 Inverse folding is similar to protein design, which seeks to find a sequence or set of sequences that 
will fold into a desired structure. These approaches can be contrasted with a "protein folding" 
approach which attempts to predict a structure taken by a given sequence. 

The general preferred approach of the present invention is as follows, although alternate embodiments 
15 are discussed below. A known protein structure is used as the starting point. The residues to be 

optimized are then identified, which may be the entire sequence or subset(s) thereof. The side chains 
of any positions to be varied are then removed. The resulting structure consisting of the protein 
backbone and the remaining sidechains is called the template. Each variable residue position is then 
preferably classified as a core residue, a surface residue, or a boundary residue; each classification 
20 defines a subset of possible amino acid residues for the position (for example, core residues generally 
will be selected from the set of hydrophobic residues, surface residues generally will be selected from 
the hydrophilic residues, and boundary residues may be either). Each amino acid can be represented 
by a discrete set of all allowed conformers of each side chain, called rotamers. Thus, to arrive at an 
optimal sequence for a backbone, all possible sequences of rotamers must be screened, where each 
25 backbone position can be occupied either by each amino acid in all its possible rotameric states, or a 
subset of amino acids, and thus a subset of rotamers. 

Two sets of interactions are then calculated for each rotamer at every position: the interaction of the 
rotamer side chain with all or part of the backbone (the "singles" energy, also called the 

30 rotamer/template or rotamer/backbone energy), and the interaction of the rotamer side chain with all 

other possible rotamers at every other position or a subset of the other positions (the "doubles" energy, 
also called the rotamer/rotamer energy). The energy of each of these interactions is calculated 
through the use of a variety of scoring functions, which include the energy of van der Waal's forces, 
the energy of hydrogen bonding, the energy of secondary structure propensity, the energy of surface 

35 area solvation and the electrostatics. Thus, the total energy of each rotamer interaction, both with the 
backbone and other rotamers, is calculated, and stored in a matrix form. Thus, a sample matrix is 
generated for the singles calculation, and for the doubles calculations. 
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The discrete nature of rotamer sets allows a simple calculation of the number of rotamer sequences to 
be tested. A backbone of length n with m possible rotamers per position will have m n possible rotamer 
sequences, a number which grows exponentially with sequence length and renders the calculations 
either unwieldy or impossible in real time. Accordingly, to solve this combinatorial search problem, 

5 either a "Dead End Elimination" (DEE) calculation or a "Hybrid Exact Rotamer Optimization" (HERO) 

calculation, or both, are performed. The DEE calculation is based on the fact that if the worst total 
interaction of a first rotamer is still better than the best total interaction of a second rotamer, then the 
second rotamer cannot be part of the global optimum solution. Since the energies of all rotamers 
have already been calculated, the DEE approach only requires sums over the sequence length to test 
10 and eliminate rotamers, which speeds up the calculations considerably. DEE can be rerun comparing 
pairs of rotamers, or combinations of rotamers, which will eventually result in the determination of a 
single sequence which represents the global optimum energy. Alternatively, a HERO calculation can 
be done. 

1 fr In a preferred embodiment, a Hybrid Exact Rotamer Optimization (HERO) calculation is done to find 
the single sequence which represents the global optimum energy. Thus, the computational 
processing includes one or more HERO computational steps as outlined below. 

Accordingly, the present invention provides HERO, an exact algorithm for determining the global 
20- minimum energy conformation (GMEC) of a set of discrete side chain rotamers anchored to fixed 

protein backbone coordinates. For large design problems, HERO converges significantly faster than 
existing exact algorithms, and provides solutions for design cases that were previously intractable by 
any known exact search method. 

25 HERO employs both dominance criteria and bounding criteria to eliminate single rotamers, and flag 

pairs of rotamers that are incompatible with the GMEC conformation. The bounding criteria require a 
reliable cutoff energy from a valid conformation. This is obtained during the calculation using parallel 
Monte Carlo searches from the current state of the HERO conformational ensemble. 

30 The dominance criteria were originally developed for use in the Dead-End Elimination (DEE) algorithm 
and the bounding criteria were originally developed for use in Branch and Terminate (B & T) (Gordon, 
DB., and Mayo, SL, (1999) Structure, 7:1089-1097, hereby incorporated by reference in its entirety). 
While the method employs a stochastic Monte Carlo search to inform the bounding criteria, HERO 
remains an exact search method. The novelty of HERO is the simultaneous incorporation of three 

35 completely different search paradigms into a single compatible exact search algorithm. 

In a preferred embodiment, HERO comprises three search paradigms into a single exact search 
algorithm. The three search paradigms comprise dominance criteria, bounding criteria, and a 
stochastic search paradigm, wherein ail three search paradigms are run simultaneously. These three 
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search paradigms are facilitated by the process of unification, which is described below with particular 
reference to "super residues". 

Dominance criteria comprise Dead-end elimination algorithms. By "Dead-end elimination algorithms" 
5 herein is meant the original DEE theorem or a variation thereof (as outlined below). Thus, DEE 
algorithms which may be used include: original DEE, simple Goldstein DEE (7=1; Equation 36), 
general Goldstein DEE (7~>1; Equation 37), Goldstein DEE doubles (described below), magic bullet 
Goldsteins doubles (described below); bottom line DEE, simple split DEE (s = 1; Equation 38); split 
singles DEE (s =2; Equation 44); split singles DEE (s > 1; Equation 39); magic bullet splitting (also 
10 referred to herein as magic bullet metric; Equation 40); and, split flags (Equation 45). The dominance 
criteria may be used alone or in combination. 

Bounding criteria comprise singles bounding criteria (Equation 42) and doubles bounding criteria 
(Equation 43). 

15; 

~ Preferably, the stochastic search paradigm comprises a Monte Carlo search. 

Accordingly, the present invention describes a new version of singles DEE that effectively splits the 
conformational space into partitions. A rotamer at a given position can then be eliminated if, within 
20 each of the partitions, at least one of the other candidate rotamers at that location always produces 
lower paii-wise interaction energies. A hierarchy of splittings is defined, the simplest of which remains 
0(n 3 p 2 ), while substantially increasing the number of rotamers that are identified as dead-ending. 
Here p is the number of residue positions and n is the average number of rotamers per position. 

25 Using a potential function described in terms of pairwise interactions, the total energy of a protein can 
be evaluated with: 

Equation 34 

E total = E template+lL E ( i r )+S H E 0M 
i i j,j<i 

Here, E template represents the self-energy of the back-bone and E(i r ) represents the energy of rotamer r 
30 at position /', including both the self-energy and the interaction energy with the backbone. The term 

EUn j u ) represents the interaction energy between rotamers r and u at positions / and j, respectively. It 
is precisely the calculation of this total energy for each possible conformation that DEE seeks to avoid. 

The original dead-end elimination criterion states that a rotamer, / r , can be eliminated if an alternative 
35 rotamer, /„ at the same position (see Fig. 1 0A) satisfies: 



Equation 35 



E{i r )+ X vomE{i r j u )>E{i t ')+ J] maxE(i t ,j u ) 



This condition implies that i r can be eliminated if the net energy contribution resulting from its best 
case pairwise interactions with rotamers at all other positions (spanned by j u ) is still worse than that 
produced by the worst case pairwise interactions of some other candidate rotamer, /„ at the same 
position. To help in visualizing the significance of this criterion, Figure 10A depicts some relevant 
energy landscapes. Here, the abscissa represents all possible conformations of the protein and the 
ordinate describes the net energy contribution produced by interactions with specific rotamers at 
position /'. It is important to note that these energy profiles are not actually continuous and are instead 
composed of discrete points displayed in some arbitrary ordering of conformational space. The left- 
hand side of eq. (35) identifies the energy corresponding to the best case conformation A r tor rotamer 
/ r , and the right-hand side identifies the energies for worst case conformations /4 t1 and A a for candidate 
rotamers / t1 and / t2 , respectively. Hence, in the present scenario, rotamer / r could be eliminated by 
rotamer / t1 but not by i a . 

Goldstein (Goldstein, RF. (1994) J. Prot. Eng., 66:1335) improved on this idea using the more 
powerful criterion shown in Equation 36 ("Simple Goldstein DEE (7= 1)): 
Equation 36 



which states that rotamer i r can be eliminated if the contribution to the total energy is always reduced 
by using an alternative rotamer, In Figure 10B, the criterion measures the minimum difference 
between the profile for / r and the profiles using other candidate rotamers. In the present example, 
rotamers ; t1 and i G are both able to eliminate ;' r , because the differences at the points of closest 
approach (6 t1 and B a , respectively) are positive in both cases. This increased elimination power is a 
tremendous advantage in reducing the combinatorial size of the problem prior to resorting to doubles 
calculations. However, the criterion is still unable to eliminate rotamer i r if all of the candidate / ( 
rotamers have energy profiles that cross the i r profile for at least one conformation. 

To attempt to cope with this more challenging scenario, Goldstein proposed a more general criterion 




(T>1): 



Equation 37 




which uses a weighted average of 7 candidate rotamers to attempt to eliminate i r If some average 
of candidates is always better than the energy produced by i n it then follows that there is always at 



least one candidate i t that provides a better alternative to i n regardless of which conformation turns out 
to be the GMEC. Figure 10C illustrates a case for which \ r can be eliminated by the average of \ n and 
i a with constants C t1 = C a = Yz. One complication with this approach is that the choice of constants is 
not obvious, although Lastersetal. (Lasters, I., etal. (1997), J. Prat. Chem., 16:449-452) 
5 demonstrated that linear programming can be used to efficiently select suitable weights C t . However, 
even if the optimal weights can be identified, this criterion is still more conservative than the most 
general idea underlying DEE. In a case in which none of the candidate / ( rotamers have lower 
energies than / r for all conformations (and hence cannot eliminate / r by themselves), then they will 
necessarily be raising the average of the hybrid somewhere in conformation space. 

10 

Theoretically, it is unnecessary for the same / ( rotamer to eliminate / r for all regions of conformational 
space. Instead rotamer ; r can be eliminated if at least one candidate rotamer produces a lower energy 
for each possible conformation. As illustrated in Figure 10D, rotamer i r could thus be eliminated if the 
"bottom line" (Desmet, J., et al. In Altman, RB, et al (eds) Pacific Symposium on Biocomputing 1997, 
15 world Scientific: Singapore, 1997, p. 122) taken as the minimum energy of all the other possible 

alternative /, candidates was always less than that produced using i r Although the bottom line criterion 
■=- is theoretically the most powerful DEE elimination criterion, it is not apparent how to implement the 
approach. 

20 Alternatively, it is possible to split conformational space into partitions, within which each of the 

candidate i t rotamers can be compared singly with i r . If at least one /, rotamer produces a lower energy 
for each partition, then i r can be eliminated even if no single /, satisfies the simple Goldstein criterion 
for all partitions. Hence, for suitably defined partitions, it would be possible for rotamers /„ and i t2 of 
Figure 10D to jointly eliminate rotamer /„ even though neither of them could accomplish the feat alone. 

25 

The partitioning approach adopted herein is to split the conformational space into O(n) equally sized 
partitions using the rotamers at some position k * i. 

In a preferred embodiment, simple split DEE (s =1) is used to split conformation space into partitions. 
30 The criterion for simple split DEE (s = 1) is shown in Equation 38: 

Equation 38 

E(i r )- E(i t )+ £ { min[£( / r y„ )- E(i t j u )]} + [E(i r k v )- E{i,k v )]> 0 

35 stating that i r can be eliminated if, for each splitting rotamer v, at some splitting position k * /, there 

exists an i t rotamer that yields a lower net energy contribution for all conformations within that partition. 
Domination in partition k v is automatic if (i p k v ) is a flagged pair. The splitting position k is thus removed 
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from the summation in simple Goldstein DEE so that the relative merits of i r and the various /, 
candidates can be evaluated for each of the splitting rotamers /^corresponding to individual partitions 
of the conformational space. The split DEE criterion is illustrated in Figure 11 A. Figure 10E illustrates 
a case in which i r is successfully eliminated by splitting the conformation space into two partitions 
corresponding to the splitting rotamers /c v1 and k^. 

In a preferred embodiment, the number of splitting positions is increased. Increasing the number of 
splitting positions increase both the elimination power and the computational complexity. For 
example, for two splitting positions (s = 2), there are 0(n 2 ) partitions and i r may be eliminated if, for 
each pair of splitting rotamers k v and h w at splitting positions k*h*i, there exists an i f rotamer that 
dominates i r in that partition: 

Equation 44 



E(i r )-E(i,) + X \rnm[E(i r JJ-E(i,J u )]i 
+ [E(i r ,k v ) - E(i t ,k v )]+ [E(i r ,h w ) - E(i, ,h w )] > 0 

Domination follows automatically if either (i n k v ) or (i„h„) is a flagged pair. The application of this 
criterion is illustrated in Figure 1 1B, where the rotamers at the second splitting position (h w ) effectively 
create sub-partitions of those created by the first position {k v ). The computational complexity for (s=2) 
split DEE is 0(/? 4 p 3 )- 

In a preferred embodiment, "split singles" DEE (s > 1) is used to expand a larger fraction of the 
combinatorial space by using s splitting locations simultaneously, corresponding to 0{n s ) partitions. 
Split singles DEE (s > 1) criterion is defined: eliminate i r if for all unique combinations of v at some: 

k l ...k i * i 

there exists an / ( such that Equation 39 holds: 

Equation 39 

E(i r ) - £(/,) + I mm [ E (iJJ - E(ijj] + ^E[ £ 0'A) - E{i,k v )] >0 

This criterion may be applied for any s in the range 2, 3, 4, . . ., p-1 , where p is the total number of 
residue positions in the design. 
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As the number of splitting positions and partitions increase, so does the cost per iteration. 

Although split DEE criteria may be extended to flag pairs of rotamers exactly as for the Goldstein 
doubles criterion (as described below), flags (i.e., split flags) also can be generated during split singles 
5 calculations with no increase in computational complexity (See Figure 12). This concept is illustrated 
in Figures 11A -11D. In Figures 11A-1 1D, the abscissa represents all possible conformations of the 
protein and the ordinate represents the net energy contributions produced by interactions with specific 
rotamers at position i r 

10 Figure 1 1A depicts split DEE elimination (s=1). i r is dominated by i t1 and i t2 in the partitions 

corresponding to splitting rotamers k v1 and k v2 , respectively. Hence, i r may be eliminated even though 
it is not dominated by any single rotamer for all of conformational space. Figure 1 1C illustrates a 
scenario in which i r cannot be eliminated by split DEE (s=1) because there are some partitions in 
which no i, rotamer dominates i r In this case, it is still possible to identify dead-ending pairs during the 
is process of discovering this negative result. In those partitions k v where i r is dominated by some i t , the 
% rotamer pairs {i r ,k v ) may be flagged as dead-ending. This concept is illustrated in Figure 11 C. The 
• * comparisons that are made in an effort to identify flags remain a subset of those made for a full 

Goldstein doubles calculation. 0(h) dominance comparisons are made in attempting to flag each 
if rotamer pair. The complexity of split DEE is unaffected by this modification, remaining 0(/? 3 p 2 ) for 
i20 (s=1). 

M, Figure 1 1 B depicts split elimination (s=2). As neither i„ or i a dominates i r in partition k v2 , a second 

splitting position is used to create sub-partitions h w1 and h w2 , where i r is dominated by /„ and i a , 
— =: respectively. Thus, i r may be eliminated using two splitting positions. Figure 1 1C depicts split flagging 

=25 (s=1). I r is not dominated in partition /f„ 2 so elimination is not possible with only one splitting position. 
However, i r is dominated by i t1 in partition k v1 , so that pair (/ r ,/c w ) may be flagged. Figure 1 1 D depicts 
split flagging (s=2). i r is no longer dominated for all of conformational space so it cannot be eliminated 
with only two splitting positions. However, i r is dominated for partition k^ by i t1 and i t2 in sub-partitions 
h w1 and h w2 , respectively. Hence, the pair (i n k v2 ) may be flagged. Likewise, the pair (i n h w2 ) may be 
30 flagged, as becomes more readily apparent if the hierarchy of the splitting positions k and h is 
reversed. 

In a preferred embodiment, split singles DEE with split flags are generated with arbitrary numbers of 
splitting positions. This concept is illustrated for split DEE (s=2) in Figure 1 1D. In split flags, the 
35 number of flagging comparisons remains O(n) per rotamer pair but the iteration complexity increases 
to 0(n 4 p 3 )- For a given maximum number of splitting positions, the attempt to eliminate a rotamer i r 
fails as soon as a sub-partition at the lowest level is encountered in which i r is not dominated by some 
competitor. See also Looger, L. L, and Hellinga, H.W. (2001) J. Mol. Biol., 307:429-445 for the 
special case of s=1. 
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In a preferred embodiment, split singles DEE flags with split flags (s=1) is used until no further 
eliminations are possible. 

In a preferred embodiment, split singles DEE with split flags (s>2) is run once for each rotamer with or 
5 without magic bullet metric (equation 40). 

The singles criteria described above form the heart of the DEE approach. However, there are a 
number of refinements based on this foundation that substantially increase the performance of the 
algorithm. The most important of these is the use of doubles criteria to flag dead ending pairs for 
10 more efficient elimination using the singles criteria . Other refinements include a magic bullet doubles 
version of Goldstein criterion outlined below (see also Gordon, DB., and Mayo, SL. (1998) J. Comput. 
Chem., 19:1505; incorporated herein in its entirety). Magic bullet splitting may also be used to improve 
the performance of split singles DEE. 

1 5 When considering the use of split singles DEE, it is advantageous to keep the complexity well below 
0(n 5 p 3 ), so as to keep the cost of singles elimination a small fraction of the more expensive doubles 
process. The complexity estimate for split singles (Equation 39) reveals that the cost grows rapidly as 
s is increased. For s = 1 , the cost is 0(n 3 p 2 ), which is identical to the estimate for simple Goldstein 
singles. To reduce the cost for s > 1 , it is desirable to use only the one "magic bullet" splitting /q . . . k s 

20 that appears most likely to eliminate rotamer i r \ in this case, the cost of magic bullet split DEE reduces 
to 0(n 2+s p). 

In a preferred embodiment, magic bullet split singles DEE is used. Intuitively, the best splitting 
locations for magic bullet split single DEE should be those that have strong interactions with the 
25 rotamers at /. Accordingly, the objective is to find splitting positions /c, . . . k s , such that the candidate i, 
rotamers have widely varying interaction energies with rotamers at these positions /q . . . k s . 
Assuming no single i t rotamer is able to eliminate i n the splitting will then enable other candidate /, 
rotamers to contribute to the elimination of i r . For each i n the following "magic bullet metric" can be 
used to rank the positions /(,... k s with which the /, rotamers have the strongest adverse interactions: 

30 

Equation 40 

min min min ^E{i r k v ) - E(i t k v )] 

For s = 2 magic bullet splitting (s = 2 mb ), rotamers at the top two ranked positions are then used to split 
the conformational space, corresponding to a complexity of 0(n 4 p). Using magic bullet splitting pairs 
35 is thus only a factor of 0(n/p) more expensive than using s = 1 splitting performed at all positions. For 
s = 3, magic bullet splitting triplets (s = 3 mb )with a cost bound of 0{n 5 p) are also less expensive than 
Goldstein doubles. 
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In addition to the DEE dominance criteria described above, HERO also uses bounding criteria to 
eliminate rotamers and flag rotamer pairs that are incompatible with the GMEC. The bounding criteria 
for single rotamers and pairs of rotamers are special cases of the criteria used for B & T. The 
bounding criteria are summarized in Equations 41 - 43, using the following notation: 

5 

Equation 41 

F (j , . E{i r ) + E{j u ) E{i r J u ) 
E pair O r ,jJ- 2p _ 2 +— r ~, 

where p is the number of positions in the design. The energy bound for a single rotamer, referred to 
10 herein as "singles bounding criteria" is shown in Equation 42: 

Equation 42 




15 and /; can be eliminated if E bound (/ r ) > E low , where E, ow is the lowest known energy of a valid 

conformation determined by the Monte Carlo searches. The energy bound for a pair of rotamers, 
: referred to herein as "doubles bounding criteria" is shown in Equation 43: 

Equation 43 

20 

E bound(. i rJu> 2E pair(. i rJu) + X min \ 2E pai r ( i r^t)+ 2E pai r (Ju k t) + X min [^V Ov A )l \ , 
k*i*j ' { m*k*j*i v 1 \ 

and pair {i r , j„) can be flagged if E bound (/„ j u ) > E iow . 

HERO is parallelized to run across any number of processors. If HERO is run on N processors, then N 
25 independent Monte Carlo searches are performed during the stochastic search phase of the algorithm. 

Regardless of the DEE or bounding criterion used, it is generally beneficial to apply the condition 
iteratively, as previous eliminations often facilitate the elimination of further rotamers. Eventually, no 
further rotamers will be eliminated at any position by additional rounds of DEE or bounding. At this 
30 point, it is necessary to resort to a doubles calculation to flag dead-ending pairs, which can then be 

used to increase the effectiveness of the singles elimination criterion. "Dead Ending Pairs" are pairs of 
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rotamers that have been identified as being incompatible with GMEC. After further applications of a 
singles elimination criterion, it again becomes impossible to eliminate further rotamers. At this point, 
the rotamers at two positions are "unified" to form a superresidue that is treated as a single position for 
the remainder of the calculation. This process permanently expands a fraction of the combinatorial 
5 space and sets off a new cascade of singles eliminations. Furthermore, the two unified positions are 
chosen to be those with the highest fraction of dead-ending pairs. These pairs become dead-ending 
super-rotamers of the new superresidue and can thus be eliminated at the time of unification. 

In a preferred embodiment, HERO is used as the analyzing step to generate a set of optimized protein 
10 sequences. By "HERO" herein is meant an algorithm used to identify a single rotamer at each position 
that belongs to the global minimum energy conformation (GMEC). 

In a preferred embodiment, HERO is run using split singles DEE. In this embodiment, HERO includes 
the following steps: 1) Goldstein singles DEE (7= 1), repeated until no further eliminations can be 

15- made; 2) simple split DEE ( s = 1), repeated until no further eliminations can be made; 3) split singles 
DEE (s > 1) with or without magic bullet metric, once for each rotamer; 4) application of singles 
bounding criteria to eliminate rotamers whose bound energy E bound is greater than E low , the lowest 
known energy of a valid conformation obtained from the Monte Carlo searches; 5) alternating 
sequentially between one of the following, applying one during each cycle: a) magic bullet Goldstein 

20 doubles calculation to flag dead ending pairs; b) Monte Carlo search to find a low energy of a valid 

conformation E, ow , c) applying doubles bounding criteria to flag pairs whose bounding energy E bound is 
greater than E low ; d) Goldstein DEE doubles calculation to flag dead ending pairs; and, e) unification to 
identify super residues (see Figure 3B); and 6) repeating steps 1-5 until the GMEC is found. 

25 In a preferred embodiment, HERO is run using split singles DEE with split flags. In this embodiment, 
HERO includes the following steps: 1) Goldstein singles DEE, repeated until no further eliminations 
can be made; 2) split singles DEE with split flags (s=1), repeated until no further eliminations can be 
made; 3) split singles DEE with split flags (s>2) once for each rotamer, with or without magic bullet 
metric; 4) application of singles bounding criteria to eliminate rotamers whose bound energy E bomd is 

30 greater than E low , the lowest known energy of a valid conformation obtained from the Monte Carlo 

searches; 5) alternating sequentially between one of the following, applying one during each cycle: a) 
magic bullet Goldstein doubles calculation to flag dead ending pairs; b) Monte Carlo search to find a 
low energy of a valid conformation E low , c) applying doubles bounding criteria to flag pairs whose 
bounding energy E bound is greater than E l0w ; d) Goldstein DEE doubles calculation to flag dead ending 

35 pairs; and, e) unification to identify super residues (see Figure 3B); and 6) repeating steps 1-5 until the 
GMEC is found. 

Once the global solution has been found, a Monte Carlo search may be done to generate a rank- 
ordered list of sequences in the neighborhood of the DEE or HERO solution. Starting at the DEE or 
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HERO solution, random positions are changed to other rotamers, and the new sequence energy is 
calculated. If the new sequence meets the criteria for acceptance, it is used as a starting point for 
another jump. After a predetermined number of jumps, a rank-ordered list of sequences is generated. 
The results may then be experimentally verified by physically generating one or more of the protein 
5 sequences followed by experimental testing. The information obtained from the testing can then be 
fed back into the analysis, to modify the procedure if necessary. 

Thus, the present invention provides a computer-assisted method of designing a protein. The method 
comprises providing a protein backbone structure with variable residue positions, and then 
10 establishing a group of potential rotamers for each of the residue positions. As used herein, the 
backbone, or template, includes the backbone atoms and any fixed side chains. The interactions 
between the protein backbone and the potential rotamers, and between pairs of the potential rotamers, 
are then processed to generate a set of optimized protein sequences, preferably a single global 
optimum, which then may be used to generate other related sequences. 

15 

Figure 1 illustrates an automated protein design apparatus 20 in accordance with an embodiment of 
the invention. The apparatus 20 includes a central processing unit 22 which communicates with a 
f memory 24 and a set of input/output devices (e.g., keyboard, mouse, monitor, printer, etc.) 26 through 

a bus 28. The general interaction between a central processing unit 22, a memory 24, input/output 
20 devices 26, and a bus 28 is known in the art. The present invention is directed toward the automated 
protein design program 30 stored in the memory 24. 

The automated protein design program 30 may be implemented with a side chain module 32. As 
b • discussed in detail below, the side chain module establishes a group of potential rotamers for a 

2-5 selected protein backbone structure. The protein design program 30 may also be implemented with a 
ranking module 34. As discussed in detail below, the ranking module 34 analyzes the interaction of 
rotamers with the protein backbone structure to generate optimized protein sequences. The protein 
design program 30 may also include a search module 36 to execute a search, for example a Monte 
Carlo search as described below, in relation to the optimized protein sequences. Finally, an 

30 assessment module 38 may also be used to assess physical parameters associated with the derived 
proteins, as discussed further below. 

The memory 24 also stores a protein backbone structure 40, which is downloaded by a user through 
the input/output devices 26. The memory 24 also stores information on potential rotamers derived by 
35 the side chain module 32. In addition, the memory 24 stores protein sequences 44 generated by the 
ranking module 34. The protein sequences 44 may be passed as output to the input/output devices 
26. 



The operation of the automated protein design apparatus 20 is more fully appreciated with reference 
to Fig. 2. Fig. 2 illustrates processing steps executed in accordance with the method of the invention. 
As described below, many of the processing steps are executed by the protein design program 30. 
The first processing step illustrated in Fig. 2 is to provide a protein backbone structure (step 50). As 
5 previously indicated, the protein backbone structure is downloaded through the input/output devices 26 
using standard techniques. 

The protein backbone structure corresponds to a selected protein. By "protein" herein is meant at 
least two amino acids linked together by a peptide bond. As used herein, protein includes proteins, 

10 oligopeptides and peptides. The peptidyl group may comprise naturally occurring amino acids and 

peptide bonds, or synthetic peptidomimetic structures, i.e. "analogs", such as peptoids (see Simon ef 
a/., PNAS USA 89(20):9367 (1992)).. The amino acids may either be naturally occuring or non- 
naturally occuring; as will be appreciated by those in the art, any structure for which a set of rotamers 
is known or can be generated can be used as an amino acid. The side chains may be in either the (R) 

1f=; or the (S) configuration. In a preferred embodiment, the amino acids are in the (S) or (L) 
yCf configuration. 

- The chosen protein may be any protein for which a three dimensional structure is known or can be 
y generated; that is, for which there are three dimensional coordinates for each atom of the protein. 
20 Generally this can be determined using X-ray crystallographic techniques, NMR techniques, de novo 
* modelling, homology modelling, etc. In general, if X-ray structures are used, structures at 2A 
f s " resolution or better are preferred, but not required. 

H The proteins may be from any organism, including prokaryotes and eukaryotes, with enzymes from 
2S bacteria, fungi, extremeophiles such as the archebacteria, insects, fish, animals (particularly mammals 
and particularly human) and birds all possible. 

Suitable proteins include, but are not limited to, industrial and pharmaceutical proteins, including 
ligands, cell surface receptors, antigens, antibodies, cytokines, hormones, and enzymes. Suitable 
30 classes of enzymes include, but are not limited to, hydrolases such as proteases, carbohydrases, 
lipases; isomerases such as racemases, epimerases, tautomerases, or mutases; transferases, 
kinases, oxidoreductases, and phophatases. Suitable enzymes are listed in the Swiss-Prot enzyme 
database. 

35 Suitable protein backbones include, but are not limited to, all of those found in the protein data base 
compiled and serviced by the Brookhaven National Lab. 

Specifically included within "protein" are fragments and domains of known proteins, including 
functional domains such as enzymatic domains, binding domains, etc., and smaller fragments, such 
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as turns, loops, etc. That is, portions of proteins may be used as well. In addition, protein variants, i.e. 
non-naturally occurring variants, may be used. 

Once the protein is chosen, the protein backbone structure is input into the computer. By "protein 
5 backbone structure" or grammatical equivalents herein is meant the three dimensional coordinates 
that define the three dimensional structure of a particular protein. The structures which comprise a 
protein backbone structure (of a naturally occurring protein) are the nitrogen, the carbonyl carbon, the 
a -carbon, and the carbonyl oxygen, along with the direction of the vector from the oc-carbon to the (3- 
carbon. 

10 

The protein backbone structure which is input into the computer can either include the coordinates for 
both the backbone and the amino acid side chains, or just the backbone, i.e. with the coordinates for 
the amino acid side chains removed. If the former is done, the side chain atoms of each amino acid of 
the protein structure may be "stripped" or removed from the structure of a protein, as is known in the 
15 art, leaving only the coordinates for the "backbone" atoms (the nitrogen, carbonyl carbon and oxygen, 
and the cx-carbon, and the hydrogens attached to the nitrogen and a-carbon). 

In a preferred embodiment, the protein backbone structure is altered prior to the analysis outlined 
below. In this embodiment, the representation of the starting protein backbone structure is reduced to 

20 a description of the spatial arrangement of its secondary structural elements. The relative positions of 
the secondary structural elements are defined by a set of parameters called supersecondary structure 
parameters. These parameters are assigned values that can be systematically or randomly varied to 
alter the arrangement of the secondary structure elements to introduce explicit backbone flexibility. 
The atomic coordinates of the backbone are then changed to reflect the altered supersecondary 

25 structural parameters, and these new coordinates are input into the system for use in the subsequent 
protein design automation. 

Basically, a protein is first parsed into a collection of secondary structural elements which are then 
abstracted into geometrical objects. For example, as more fully outlined below, an oc-helix is 

30 represented by its helical axis and geometric center. The relative orientation and distance between 

these objects are summarized as super-secondary structure parameters. Concerted backbone motion 
can be introduced by simply modulating a protein's super-secondary structure parameter values. 
Accordingly, when all or part of the backbone is to be altered, the portion to be altered is classified as 
belonging to a particular supersecondary structure element, i.e. a/(3, a/a or p/(3, and then the 

35 supersecondary structural elements as outlined below are altered. As will be appreciated by those in 
the art, these elements need not be covalently linked, i.e. part of the same protein; for example, this 
can be done to evaluate protein-protein interactions. 
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As will be appreciated by those in the art, it is possible to alter the backbone of certain positions, while 
retaining either a particular amino acid (which can be "floated", as outlined below) or a particular 
rotamer at the position; alternatively, both the backbone can be moved and the amino acid side chain 
can be optimized as outlined herein. Similarly, the backbone can be held constant and only the amino 
5 acid side chains are optimized. Combinations of any of these at any position may be done. In 

general, when supersecondary structural parameters are altered, this is done on more than one amino 
acid, i.e. the backbone atoms of a plurality of amino acids that contribute to the secondary structure 
are moved. 

1 0 As will be appreciated by those in the art, there are a wide variety of different supersecondary structure 
parameters that can be used. Super-secondary structure parameterization has been described for 
fold classes that include a/a (Crick FHC "The Fourier transform of a coiled-coil." Acta Crystallogr 
6:685-689 (1953a); Crick FHC. "The packing of cr-helices." Acta Crystallogr 6:689-697 (1953b); 
Chothia et al., Proc Natl Acad Sci USA 78:4146^150 (1981) "Relative orientation of close-packed 0- 

15 pleated sheets in proteins"; Chothia et al., J Mol Biol 1 45:215-250 (1981) "Helix to helix packing in 
proteins"; Chou, et al. "Energetics of the structure of the four-cr-helix bundle in proteins." Proc Natl 
Acad Sci USA 85:4295-4299 (1988); Murzin AG, Finkelstein AV. "General architecture of the cr- 
helical globule." J Mol Biol 204:749-769 (1988); Presnell SR, Cohen FE. "Topological distribution of 
four-cr-helix bundles." Proc Natl Acad Sci USA 86:6592-6596 (1989); Harris et al. "Four helix bundle 

20 diversity in globular proteins." J Mol Biol 236:1356-1368 (1994)); a/0 (Chothia et al., "Structure of 
proteins: packing of cr-helices and pleated sheets." Proc Natl Acad Sci USA 74:4130-4134 (1977); 
Janin & Chothia, 1980 "Packing of cr-helices onto ^-pleated sheets and the anatomy of al{5 proteins. 
"J Mol Biol 743:95-1 28; Cohen et al., 1982, "Analysis and prediction of the packing of cr-helices 
against a /3-sheet in the tertiary structure of globular proteins." J Mol Biol 1 56:821-862; Chou et al., 

25 1985, "Interactions between an cr-heiix and /3-sheet energetics of alfi packing in proteins.) J Mol Biol 
786:591-609); and (3/p (Cohen etal., "Analysis and prediction of protein /?-sheet structures by a 
combinatorial approach." Nature 285:378-382 (1980); Cohen et al., "Analysis of the tertiary structure 
of protein yS-sheet sandwiches." J Mol Biol 748:253-272 (1981); Chothia & Janin, "Relative orientation 
of close-packed ^-pleated sheets in proteins." Proc Natl Acad Sci USA 78:4146-4150 (1981); Chothia 

30 & Janin, Proc Natl Acad Sci USA 78:3955-3965 (1982) "Orthogonal packing of ^-pleated sheets in 

proteins"; Chou et al., J Mol Biol 788:641-649 (1986) "Interactions between two /^-sheets energetics of 

packing in proteins"; Laster et al., Proc Natl Acad Sci USA 85:3338-3342 (1988)"Structure 
principles of parallel /^-barrels in proteins"; Murzin etal., J Mol Biol 236: 1 369-1 381 (1994a), "Principles 
determining the structure of /2-sheet barrels. I. A theoretical analysis"; Murzin et al. J Mol Biol 

35 236:1382-1400 (1994b) "Principles determining the structure of /3-sheet barrels. II. The observed 
structures"). All of these references are explicitly incorporated by reference herein in their entirety. 

Four different supersecondary structure parameters useful for a/(3 proteins are shown in Figure 5. In 
a preferred embodiment, as for all the supersecondary structure parameters, at least one of these 
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parameter values is altered; other embodiments utilize simultaneous or sequential alteration of two, 
three or four of these parameter values. 

For the a/p protein interactions, the helix center is defined as the average C a position of the residues 
5 chosen for backbone movement. The helix axis is defined as the principal moment of the C a atoms of 
these residues (see Chothia et al., 1981, supra). The strand axis is defined as the average of the 
least-squares lines fit through the midpoints of sequential C a positions of the two central (3-strands. 
The sheet plane is defined as the least-squares plane fit through the C a positions of the two central (3- 
strands. The sheet axis is defined as the vector perpendicular to the sheet plane that passes through 

1 0 the helix center. Q is the angle between the strand axis and the helix axis after projection onto the 
sheet plane; 6 is the angle between the helix axis and the sheet plane; h is the distance between the 
helix center and the sheet plane; o is the rotation angle about the helix axis. Backbone alteration 
requires altering at least one of these parameter values. In a preferred embodiment, the 
supersecondary structure parameter value Q is altered by changing the angle degree (either positively 

15 or negatively) of up to about 25 degrees, with changes of + 1°, 2.5°, 5°, 7.5°, and 10° being particularly 
- preferred. In a preferred embodiment, the supersecondary structure parameter value 6 is altered by 
changing the angle degree (either positively or negatively) of up to about 25 degrees, with changes of 
+ 1°, 2.5°, 5°, 7.5°, and 10° being particularly preferred, the supersecondary structure parameter 
value a is altered by changing the angle degree (either positively or negatively) of up to about 25 

20- degrees, with changes of + 1°, 2.5°, 5°, 7.5°, and 10° being particularly preferred. In a preferred 

embodiment, the supersecondary structure parameter value h is altered by changes (either positive or 
negative) of up to about 8 A, with changes of + 0.25, 0.50, 0.75, 1.00, 1.25 and 1.5 being particularly 
preferred. However, as will be appreciated by those in the art, as for all the parameter values outlined 
herein, larger changes can be made, depending on the protein (i.e. how close or far other secondary 

25 structure elements are) and whether other parameter values are made; for example, larger changes in 
Q can be made if the helix is also moved away from the sheet (i.e. h is increased). 

Four different supersecondary structure parameters useful for a/a proteins are shown in Figure 7. As 
for a/(3 parameters, the helix center is defined as the average C a position of the residues in the helix. 
30 The helix axis is defined as the principal moment of the C a atoms of the residues in the helix, o is 

defined as the rotation around the helix axis. Q is the angle between two strand axes after projection 
onto a plane. Thus, d, the distance between the helices, can be altered, generally as outlined above 
for h. Similarly, 9, o and Q can be altered as above. 

35 There are a number of different supersecondary structure parameters useful for p/p proteins. P-barrel 
configurations contain a number of different parameters that can be altered, as shown in Figure 6. 
These include: (see Figure 6A) R, the barrel radius; a, the angle of tilt of the strands relative to the 
barrel axis; and b, the interstrand distance; (see Figure 6B) 9, the mean twist of the P-sheet about an 
axis perpendicular to the strand direction; t, the mean twist of the P-sheet about an axis parallel to the 
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strand direction; e the mean coiling of the p-sheet along the strands; n, the mean coiling of the (3- 
sheet along a line perpendicular to the strands; and (Figure 6C) Q is angle between the two (3-sheet 
axes. As for the a/p parameter values, each of these may be altered, either positively or negatively. 
Generally, changes are made in at least one of these parameter values, by changing the angle degree 
5 (either positively or negatively) of up to about 25 degrees, with changes of + 1°, 2.5°, 5°, 7.5°, and 10° 
being particularly preferred, b can be changed up to + 1 A. For p sandwich structures (Figure 6C and 
6D), O can be altered up to + 45°, with changes of + 1°, 2.5°, 5°, 7.5°, and 10° being particularly 
preferred. Similarly, h can be altered as outlined above for a/P proteins, and 8 and <\> can be altered 
up to + 30°. 

10 

Once the desired value changes are selected, the coordinate positions for the positions chosen are 
altered to reflect the change, to form a "new" or "altered" backbone protein structure, i.e. one that has 
all or part of the backbone atoms in a different position relative to the starting structure. It should be 
noted that this process can be repeated, i.e. additional backbone changes can be made, on the same 
1 & or different residues. In addition, after optimization, the backbone of one or more optima! sequences 
can altered and an optimization can be run. 

Alternatively, movement of the backbone can be done manually, i.e. sections of the protein backbone 
can be randomly or arbitrarily moved, in this embodiment, the backbone atoms of one or more amino 
20 acids can be moved some distance, generally an angstrom or more, in any direction. This can be 

done using standard modeling programs; for example, Molecular Dynamics minimization, Monte Carlo 
; dynamics, or random backbone coordinate/angle motion. It is also possible to move the backbone 
atoms of single residues, that are either components of secondary structural elements or not. 

25 Once a protein structure backbone is generated (with alterations, as outlined above) and input into the 
computer, explicit hydrogens are added if not included within the structure (for example, if the structure 
was generated by X-ray crystallography, hydrogens must be added). After hydrogen addition, energy 
minimization of the structure is run, to relax the hydrogens as well as the other atoms, bond angles 
and bond lengths. In a preferred embodiment, this is done by doing a number of steps of conjugate 

30 gradient minimization (Mayo era/., J. Phys. Chem. 94:8897 (1990)) of atomic coordinate positions to 
minimize the Dreiding force field with no electrostatics. Generally from about 10 to about 250 steps is 
preferred, with about 50 being most preferred. 

The protein backbone structure contains at least one variable residue position. As is known in the art, 
35 the residues, or amino acids, of proteins are generally sequentially numbered starting with the N- 
terminus of the protein. Thus a protein having a methionine at it's N-terminus is said to have a 
methionine at residue or amino acid position 1, with the next residues as 2, 3, 4, etc. At each position, 
the wild type (i.e. naturally occurring) protein may have one of at least 20 amino acids, in any number 
of rotamers. By "variable residue position" herein is meant an amino acid position of the protein to be 
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designed that is not fixed in the design method as a specific residue or rotamer, generally the wild-type 
residue or rotamer. 

In a preferred embodiment, all of the residue positions of the protein are variable. That is, every amino 
5 acid side chain may be altered in the methods of the present invention. This is particularly desirable 
for smaller proteins, although the present methods allow the design of larger proteins as well. While 
there is no theoretical limit to the length of the protein which may be designed this way, there is a 
practical computational limit. 

10 In an alternate preferred embodiment, only some of the residue positions of the protein are variable, 
and the remainder are "fixed", that is, they are identified in the three dimensional structure as being in 
a set conformation. In some embodiments, a fixed position is left in its original conformation (which 
may or may not correlate to a specific rotamer of the rotamer library being used). Alternatively, 
residues may be fixed as a non-wild type residue; for example, when known site-directed mutagenesis 

15 techniques have shown that a particular residue is desirable (for example, to eliminate a proteolytic 
site or alter the substrate specificity of an enzyme), the residue may be fixed as a particular amino 
acid. Alternatively, the methods of the present invention may be used to evaluate mutations de novo, 
as is discussed below. In an alternate preferred embodiment, a fixed position may be "floated"; the 
amino acid at that position is fixed, but different rotamers of that amino acid are tested. In this 

20 embodiment, the variable residues may be at least one, or anywhere from 0.1% to 99.9% of the total 
number of residues. Thus, for example, it may be possible to change only a few (or one) residues, or 
: most of the residues, with all possibilities in between. 

In a preferred embodiment, residues which can be fixed include, but are not limited to, structurally or 
25 biologically functional residues. For example, residues which are known to be important for biological 
activity, such as the residues which form the active site of an enzyme, the substrate binding site of an 
enzyme, the binding site for a binding partner (ligand/receptor, antigen/antibody, etc.), phosphorylation 
or glycosylation sites which are crucial to biological function, or structurally important residues, such as 
disulfide bridges, metal binding sites, critical hydrogen bonding residues, residues critical for backbone 
30 conformation such as proline or glycine, residues critical for packing interactions, etc. may all be fixed 
in a conformation or as a single rotamer, or "floated". 

Similarly, residues which may be chosen as variable residues may be those that confer undesirable 
biological attributes, such as susceptibility to proteolytic degradation, dimerization or aggregation sites, 
35 glycosylation sites which may lead to immune responses, unwanted binding activity, unwanted 
allostery, undesirable enzyme activity but with a preservation of binding, etc. 

As will be appreciated by those in the art, the methods of the present invention allow computational 
testing of "site-directed mutagenesis" targets without actually making the mutants, or prior to making 
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the mutants. That is, quick analysis of sequences in which a small number of residues are changed 
can be done to evaluate whether a proposed change is desirable. In addition, this may be done on a 
known protein, or on an protein optimized as described herein. 

5 As will be appreciated by those in the art, a domain of a larger protein may essentially be treated as a 
small independent protein; that is, a structural or functional domain of a large protein may have 
minimal interactions with the remainder of the protein and may essentially be treated as if it were 
autonomous. In this embodiment, all or part of the residues of the domain may be variable. 

10 It should be noted that even if a position is chosen as a variable position, it is possible that the 

methods of the invention will optimize the sequence in such a way as to select the wild type residue at 
the variable position. This generally occurs more frequently for core residues, and less regularly for 
surface residues. In addition, it is possible to fix residues as non-wild type amino acids as well. 

1|h Once the protein backbone structure has been selected and input, and the variable residue positions 
\0 chosen, a group of potential rotamers for each of the variable residue positions is established. This 
" f operation is shown as step 52 in Figure 2. This step may be implemented using the side chain module 
rj 32. In one embodiment of the invention, the side chain module 32 includes at least one rotamer 
^ library, as described below, and program code that correlates the selected protein backbone structure 

20 with corresponding information in the rotamer library. Alternatively, the side chain module 32 may be 
s omitted and the potential rotamers 42 for the selected protein backbone structure may be downloaded 
f* through the input/output devices 26. 

H ; As is known in the art, each amino acid side chain has a set of possible conformers, called rotamers. 

2p* See Ponder, era/., Acad. Press Inc. (London) Ltd. pp. 775-791 (1987); Dunbrack, era/., Struc. Biol. 
1(5):334-340 (1994); Desmet, era/., Nature 356:539-542 (1992), all of which are hereby expressly 
incorporated by reference in their entirety. Thus, a set of discrete rotamers for every amino acid side 
chain is used. There are two general types of rotamer libraries: backbone dependent and backbone 
independent. A backbone dependent rotamer library allows different rotamers depending on the 

30 position of the residue in the backbone; thus for example, certain leucine rotamers are allowed if the 
position is within an a helix, and different leucine rotamers are allowed if the position is not in a a- 
helix. A backbone independent rotamer library utilizes all rotamers of an amino acid at every position. 
In general, a backbone independent library is preferred in the consideration of core residues, since 
flexibility in the core is important. However, backbone independent libraries are computationally more 

35 expensive, and thus for surface and boundary positions, a backbone dependent library is preferred. 
However, either type of library can be used at any position. 

In addition, a preferred embodiment does a type of "fine tuning" of the rotamer library by expanding the 
possible x (chi) angle values of the rotamers by plus and minus one standard deviation (or more) 
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about the mean value, in order to minimize possible errors that might arise from the discreteness of 
the library. This is particularly important for aromatic residues, and fairly important for hydrophobic 
residues, due to the increased requirements for flexibility in the core and the rigidity of aromatic rings; 
it is not as important for the other residues. Thus a preferred embodiment expands the Xi and x 2 
5 angles for all amino acids except Met, Arg and Lys. 

To roughly illustrate the numbers of rotamers, in one version of the Dunbrack & Karplus backbone- 
dependent rotamer library, alanine has 1 rotamer, glycine has 1 rotamer, arginine has 55 rotamers, 
threonine has 9 rotamers, lysine has 57 rotamers, glutamic acid has 69 rotamers, asparagine has 54 
1 0 rotamers, aspartic acid has 27 rotamers, tryptophan has 54 rotamers, tyrosine has 36 rotamers, 

cysteine has 9 rotamers, glutamine has 69 rotamers, histidine has 54 rotamers, valine has 9 rotamers, 
isoleucine has 45 rotamers, leucine has 36 rotamers, methionine has 21 rotamers, serine has 9 
rotamers, and phenylalanine has 36 rotamers. 

15 In general, proline is not generally used, since it will rarely be chosen for any position, although it can 
be included if desired. Similarly, a preferred embodiment omits cysteine as a consideration, only to 
avoid potential disulfide problems, although it can be included if desired. 

As will be appreciated by those in the art, other rotamer libraries with all dihedral angles staggered can 
20 be used or generated. 

In a preferred embodiment, at a minimum, at least one variable position has rotamers from at least 
two different amino acid side chains; that is, a sequence is being optimized, rather than a structure. 

25 In a preferred embodiment, rotamers from all of the amino acids (or all of them except cysteine, 

glycine and proline) are used for each variable residue position; that is, the group or set of potential 
rotamers at each variable position is every possible rotamer of each amino acid. This is especially 
preferred when the number of variable positions is not high as this type of analysis can be 
computationally expensive. 

30 

In a preferred embodiment, each variable position is classified as either a core, surface or boundary 
residue position, although in some cases, as explained below, the variable position may be set to 
glycine to minimize backbone strain. 

35 It should be understood that quantitative protein design or optimization studies prior to the present 
invention focused almost exclusively on core residues. The present invention, however, provides 
methods for designing proteins containing core, surface and boundary positions. Alternate 
embodiments utilize methods for designing proteins containing core and surface residues, core and 
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boundary residues, and surface and boundary residues, as well as core residues alone (using the 
scoring functions of the present invention), surface residues alone, or boundary residues alone. 

The classification of residue positions as core, surface or boundary may be done in several ways, as 
5 will be appreciated by those in the art. In a preferred embodiment, the classification is done via a 
visual scan of the original protein backbone structure, including the side chains, and assigning a 
classification based on a subjective evaluation of one skilled in the art of protein modelling. 
Alternatively, a preferred embodiment utilizes an assessment of the orientation of the Ca-CP vectors 
relative to a solvent accessible surface computed using only the template Ca atoms. In a preferred 

10 embodiment, the solvent accessible surface for only the Ca atoms of the target fold is generated 

using the Connolly algorithm with a add-on radius ranging from about 4 to about 12A, with from about 
6 to about 10A being preferred, and 8 A being particularly preferred. The Ca radius used ranges from 
about 1.6A to about 2.3A, with from about 1.8 to about 2.1 A being preferred, and 1.95 A being 
especially preferred. A residue is classified as a core position if a) the distance for its Ca, along its 

1 5 Ca-CP vector, to the solvent accessible surface is greater than about 4-6 A, with greater than about 
5.0 A being especially preferred, and b) the distance for its CP to the nearest surface point is greater 
than about 1 .5-3 A, with greater than about 2.0 A being especially preferred. The remaining residues 
are classified as surface positions if the sum of the distances from their Ca, along their Ca-CP vector 
, to the solvent accessible surface, plus the distance from their CP to the closest surface point was 

20 less than about 2.5-4 A, with less than about 2.7 A being especially preferred. All remaining residues 
are classified as boundary positions. 

Once each variable position is classified as either core, surface or boundary, a set of amino acid side 
chains, and thus a set of rotamers, is assigned to each position. That is, the set of possible amino acid 

25 side chains that the program will allow to be considered at any particular position is chosen. 

Subsequently, once the possible amino acid side chains are chosen, the set of rotamers that will be 
evaluated at a particular position can be determined. Thus, a core residue will generally be selected 
from the group of hydrophobic residues consisting of alanine, valine, isoleucine, leucine, 
phenylalanine, tyrosine, tryptophan, and methionine (in some embodiments, when the a scaling factor 

30 of the van der Waals scoring function, described below, is low, methionine is removed from the set), 

and the rotamer set for each core position potentially includes rotamers for these eight amino acid side 
chains (all the rotamers if a backbone independent library is used, and subsets if a rotamer dependent 
backbone is used). Similarly, surface positions are generally selected from the group of hydrophilic 
residues consisting of alanine, serine, threonine, aspartic acid, asparagine, glutamine, glutamic acid, 

35 arginine, lysine and histidine. The rotamer set for each surface position thus includes rotamers for 
these ten residues. Finally, boundary positions are generally chosen from alanine, serine, threonine, 
aspartic acid, asparagine, glutamine, glutamic acid, arginine, lysine histidine, valine, isoleucine, 
leucine, phenylalanine, tyrosine, tryptophan, and methionine. The rotamer set for each boundary 
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position thus potentially includes every rotamer for these seventeen residues (assuming cysteine, 
glycine and proline are not used, although they can be). 

Thus, as will be appreciated by those in the art, there is a computational benefit to classifying the 
residue positions, as it decreases the number of calculations. It should also be noted that there may 
be situations where the sets of core, boundary and surface residues are altered from those described 
above; for example, under some circumstances, one or more amino acids is either added or 
subtracted from the set of allowed amino acids. For example, some proteins which dimerize or 
multimerize, or have ligand binding sites, may contain hydrophobic surface residues, etc. In addition, 
residues that do not allow helix "capping" or the favorable interaction with an a-helix dipole may be 
subtracted from a set of allowed residues. This modification of amino acid groups is done on a 
residue by residue basis. 

In a preferred embodiment, proline, cysteine and glycine are not included in the list of possible amino 
acid side chains, and thus the rotamers for these side chains are not used. However, in a preferred 
embodiment, when the variable residue position has a $ angle (that is, the dihedral angle defined by 1) 
the carbonyl carbon of the preceding amino acid; 2) the nitrogen atom of the current residue; 3) the a- 
carbon of the current residue; and 4) the carbonyl carbon of the current residue) greater than 0°, the 
position is set to glycine to minimize backbone strain. 

Once the group of potential rotamers is assigned for each variable residue position, processing 
proceeds to step 54 of Figure 2. This processing step entails analyzing interactions of the rotamers 
with each other and with the protein backbone to generate optimized protein sequences. The ranking 
module 34 may be used to perform these operations. That is, computer code is written to implement 
the following functions. Simplistically, as is generally outlined above, the processing initially 
comprises the use of a number of scoring functions, described below, to calculate energies of 
interactions of the rotamers, either to the backbone itself or other rotamers. 

The scoring functions include a Van der Waals potential scoring function, a hydrogen bond potential 
scoring function, an atomic solvation scoring function, a secondary structure propensity scoring 
function and an electrostatic scoring function. As is further described below, at least one scoring 
function is used to score each position, although the scoring functions may differ depending on the 
position classification or other considerations, like favorable interaction with an a-helix dipole. As 
outlined below, the total energy which is used in the calculations is the sum of the energy of each 
scoring function used at a particular position, as is generally shown in Equation 1: 

Equation 1 

E lotal = nE vdw + nE as + nE h . bond , ng + nE ss + nE elec 
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in Equation 1 , the total energy is the sum of the energy of the van der Waals potential (E vdw ), the 
energy of atomic solvation (E as ), the energy of hydrogen bonding (E h _ bonding ), the energy of secondary 
structure (E ss ) and the energy of electrostatic interaction (E elec ). The term n is either 0 or 1 , depending 
on whether the term is to be considered for the particular residue position, as is more fully outlined 
below. 

In a preferred embodiment, a van der Waals' scoring function is used. As is known in the art, van der 
Waals' forces are the weak, non-covalent and non-ionic forces between atoms and molecules, that is, 
the induced dipole and electron repulsion (Pauli principle) forces. 

The van der Waals scoring function is based on a van der Waals potential energy. There are a 
number of van der Waals potential energy calculations, including a Lennard-Jones 12/6 potential with 
radii and well depth parameters from the Dreiding force field, Mayo et at., J. Prot. Chem., 1990, 
expressly incorporated herein by reference, or the exponential 6 potential. Equation 2, shown below, 
is the preferred Lennard-Jones potential: 

Equation 2 

R 0 is the geometric mean of the van der Waals radii of the two atoms under consideration, and D 0 is 
the geometric mean of the well depth of the two atoms under consideration. E vdw and R are the energy 
and interatomic distance between the two atoms under consideration, as is more fully described 
below. 

In a preferred embodiment, the van der Waals forces are scaled using a scaling factor, a. Equation 3 
shows the use of a in the van der Waals Lennard-Jones potential equation: 



r ; ^ r 

The role of the a scaling factor is to change the importance of packing effects in the optimization and 
design of any particular protein. Specifically, a reduced van der Waals steric constraint can 
compensate for the restrictive effect of a fixed backbone and discrete side-chain rotamers in the 
simulation and can allow a broader sampling of sequences compatible with a desired fold. In a 
preferred embodiment, a values ranging from about 0.70 to about 1.10 can be used, with a values 
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from about 0.8 to about 1.05 being preferred, and from about 0.85 to about 1.0 being especially 
preferred. Specific a values which are preferred are 0.80, 0.85, 0.90, 0.95, 1.00, and 1.05. 

Generally speaking, variation of the van der Waals scale factor a results in four regimes of packing 
5 specificity: regime 1 where 0.9 < a < 1.05 and packing constraints dominate the sequence selection; 
regime 2 where 0.8 < a < 0.9 and the hydrophobic solvation potential begins to compete with packing 
forces; regime 3 where a < 0.8 and hydrophobic solvation dominates the design; and, regime 4 where 
a > 1 .05 and van der Waals repulsions appear to be too severe to allow meaningful sequence 
selection. In particular, different a values may be used for core, surface and boundary positions, with 
10 regimes 1 and 2 being preferred for core residues, regime 1 being preferred for surface residues, and 
regime 1 and 2 being preferred for boundary residues. 

In a preferred embodiment, the van der Waals scaling factor is used in the total energy calculations for 
each variable residue position, including core, surface and boundary positions. 

15_- 

In a preferred embodiment, an atomic solvation potential scoring function is used. As is appreciated 
by those in the art, solvent interactions of a protein are a significant factor in protein stability, and 
residue/protein hydrophobicity has been shown to be the major driving force in protein folding. Thus, 
there is an entropic cost to solvating hydrophobic surfaces, in addition to the potential for misfolding or 

20 aggregation. Accordingly, the burial of hydrophobic surfaces within a protein structure is beneficial to 
both folding and stability. Similarly, there can be a disadvantage for burying hydrophilic residues. The 
accessible surface area of a protein atom is generally defined as the area of the surface over which a 
water molecule can be placed while making van der Waals contact with this atom and not penetrating 
- any other protein atom. Thus, in a preferred embodiment, the solvation potential is generally scored 

25 by taking the total possible exposed surface area of the moiety or two independent moieties (either a 
rotamer or the first rotamer and the second rotamer), which is the reference, and subtracting out the 
"buried" area, i.e. the area which is not solvent exposed due to interactions either with the backbone or 
with other rotamers. This thus gives the exposed surface area. 

30 Alternatively, a preferred embodiment calculates the scoring function on the basis of the "buried" 

portion; i.e. the total possible exposed surface area is calculated, and then the calculated surface area 
after the interaction of the moieties is subtracted, leaving the buried surface area. A particularly 
preferred method does both of these calculations. 

35 As is more fully described below, both of these methods can be done in a variety of ways. See 

Eisenberg era/.. Nature 319:199-203 (1986); Connolly. Science 221:709-713 (1983); and Wodak, ef 
a/., Proc. Natl. Acad. Sci. USA 77(4): 1736-1 740 (1980), all of which are expressly incorporated herein 
by reference. As will be appreciated by those in the art, this solvation potential scoring function is 
conformation dependent, rather than conformation independent. 
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In a preferred embodiment, the pairwise solvation potential is implemented in two components, 
"singles" (rotamer/template) and "doubles" (rotamer/rotamer), as is more fully described below. For 
the rotamer/template buried area, the reference state is defined as the rotamer in question at residue 
position i with the backbone atoms only of residues i-1 , i and i+1 , although in some instances just i 
5 may be used. Thus, in a preferred embodiment, the solvation potential is not calculated for the 

interaction of each backbone atom with a particular rotamer, although more may be done as required. 
The area of the side chain is calculated with the backbone atoms excluding solvent but not counted in 
the area. The folded state is defined as the area of the rotamer in question at residue i, but now in the 
context of the entire template structure including non-optimized side chains, i.e. every other fixed 

10 position residue. The rotamer/template buried area is the difference between the reference and the 
folded states. The rotamer/rotamer reference area can be done in two ways; one by using simply the 
sum of the areas of the isolated rotamers; the second includes the full backbone. The folded state is 
the area of the two rotamers placed in their relative positions on the protein scaffold but with no 
template atoms present. In a preferred embodiment, the Richards definition of solvent accessible 

15 surface area (Lee and Richards, J. Mol. Biol. 55:379-400, 1971, hereby incorporated by reference) is 
used, with a probe radius ranging from 0.8 to 1 .6 A, with 1 .4 A being preferred, and Drieding van der 
Waals radii, scaled from 0.8 to 1.0. Carbon and sulfur, and all attached hydrogens, are considered 
nonpolar. Nitrogen and oxygen, and all attached hydrogens, are considered polar. Surface areas are 
calculated with the Connolly algorithm using a dot density of 10 A-2 (Connolly, (1983) (supra), hereby 

20 incorporated by reference). 

In a preferred embodiment, there is a correction for a possible overestimation of buried surface area 
which may exist in the calculation of the energy of interaction between two rotamers (but not the 
interaction of a rotamer with the backbone). Since, as is generally outlined below, rotamers are only 
25 considered in pairs, that is, a first rotamer is only compared to a second rotamer during the "doubles" 
calculations, this may overestimate the amount of buried surface area in locations where more than 
two rotamers interact, that is, where rotamers from three or more residue positions come together. 
Thus, a correction or scaling factor is used as outlined below. 

30 The general energy of solvation is shown in Equation 4: 

Equation 4 
E sa = f(SA) 

35 where E sa is the energy of solvation, f is a constant used to correlate surface area and energy, and SA 
is the surface area. This equation can be broken down, depending on which parameter is being 
evaluated. Thus, when the hydrophobic buried surface area is used, Equation 5 is appropriate: 
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Equation 5 

E sa = fi(SA bur , edhydrophob , c ) 

where f, is a constant which ranges from about 10 to about 50 cal/mol/A 2 , with 23 or 26 cal/mol/A 2 
5 being preferred. When a penalty for hydrophilic burial is being considered, the equation is shown in 
Equation 6: 

Equation 6 

^sa = fi(SA bur|edhydrophoblc ) + f2(SA buned hydroph| | |C ) 

10 

where f 2 is a constant which ranges from -50 to -250 cal/mol/A 2 , with -86 or -100 cal/mol/A 2 being 
preferred. Similarly, if a penalty for hydrophobic exposure is used, equation 7 or 8 may be used: 

Equation 7 

1 5 E sa = (SA buried hydrophobic) + ^(^A^posed hydrophobic) 

Equation 8 

^sa = fl(SA bur|edhydropriob|C ) + f2(SA bur|ed hydrophilic) + f3(SA exposed hydrophobic) + ^4(SA exposed hydrophilic) 

20 In a preferred embodiment, f 3 = -f,. 

In one embodiment, backbone atoms are not included in the calculation of surface areas, and values 
of 23 cal/mol/A 2 (f,) and -86 cal/mol/A 2 (f 2 ) are determined. 

25 In a preferred embodiment, this overcounting problem is addressed using a scaling factor that 

compensates for only the portion of the expression for pairwise area that is subject to over-counting. In 
this embodiment, values of -26 cal/mol/A 2 (f,) and 100 cal/mol/A 2 (f 2 ) are determined. 

Atomic solvation energy is expensive, in terms of computational time and resources. Accordingly, in a 
30 preferred embodiment, the solvation energy is calculated for core and/or boundary residues, but not 

surface residues, with both a calculation for core and boundary residues being preferred, although any 
combination of the three is possible. 

In a preferred embodiment, an atomic solvation potential function modified to generate only singles 
35 energies is used. This one-body approximation to the atomic solvation potential can be computed in 
several ways including, but not limited to, using Monte Carlo (MC), mean-field, and approximate dead- 
end elimination methods. In a preferred embodiment, each rotamer allowed in the design calculation 
is subjected to a MC simulation such that the rotamer of interest is forced to be in the final solution. 
This solution is then used to determine the buried and exposed surface areas for the rotamer in 
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question. The one-body atomic solvation potential is computed using these surface areas and is 
added to the other singles energies for the rotamer in question. This method completely replaces the 
need to compute rotamer-backbone and rotamer-rotamer solvation energies. Computing rotamer 
solvation energies in this way allows all of a rotamer's solvation energy to be included in its singles 
energy at the cost of using a potentially inaccurate representation of the protein surface in the region 
of the rotamer (that is, a surface that is generated by a MC simulation). 

In a preferred embodiment, a hydrogen bond potential scoring function is used. A hydrogen bond 
potential is used as predicted hydrogen bonds do contribute to designed protein stability (see Stickle et 
a/., J. Mol. Biol. 226:1143 (1992); Huyghues-Despointes etal., Biochem. 34:13267 (1995), both of 
which are expressly incorporated herein by reference). As outlined previously, explicit hydrogens are 
generated on the protein backbone structure. 

In a preferred embodiment, the hydrogen bond potential consists of a distance-dependent term and an 
angle-dependent term, as shown in Equation 9: 



where R 0 (2.8 A) and D 0 (8 kcal/mol) are the hydrogen-bond equilibrium distance and well-depth, 
respectively, and R is the donor to acceptor distance. This hydrogen bond potential is based on the 
potential used in DREIDING with more restrictive angle-dependent terms to limit the occurrence of 
unfavorable hydrogen bond geometries. The angle term varies depending on the hybridization state of 
the donor and acceptor, as shown in Equations 10, 11, 12 and 13. Equation 10 is used forsp 3 donor 
to sp 3 acceptor; Equation 1 1 is used for sp 3 donor to sp 2 acceptor, Equation 12 is used for sp 2 donor to 
sp 3 acceptor, and Equation 13 is used for sp 2 donor to sp 2 acceptor: 



Equation 9 




Equation 10 



F = cos 2 



5 2 6cos 2 (<i> -109.5) 



Equation 1 1 



F = cos 2 



3 2 6 cos 2 (J> 



Equation 12 
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F = cos 4 e 



Equation 13 
F = cos 2 0 cos 2 (max [(J) , (p ]) 

In Equations 10-13, 9 is the donor-hydrogen-acceptor angle, 4> is the hydrogen-acceptor-base angle 
(the base is the atom attached to the acceptor, for example the carbonyl carbon is the base for a 
carbonyl oxygen acceptor), and (p is the angle between the normals of the planes defined by the six 
5 atoms attached to the sp 2 centers (the supplement of cp is used when <p is less than 90°). The 

hydrogen-bond function is only evaluated when 2.6 A < R < 3.2 A, 9 > 90°, 4> - 109.5° < 90° for the sp 3 
donor - sp 3 acceptor case, and, <\> > 90° for the sp 3 donor - sp 2 acceptor case; preferably, no switching 
functions are used. Template donors and acceptors that are involved in template-template hydrogen 
bonds are preferably not included in the donor and acceptor lists. For the purpose of exclusion, a 
JO template-template hydrogen bond is considered to exist when 2.5 A < R < 3.3 A and 9 > 135°. 

The hydrogen-bond potential may also be combined or used with a weak coulombic term that includes 
. a distance-dependent dielectric constant of 40R, where R is the interatomic distance. Partial atomic 

charges are preferably only applied to polar functional groups. A net formal charge of +1 is used for 
1 5 Arg and Lys and a net formal charge of -1 is used for Asp and Glu; see Gasteiger, et al. , Tetrahedron 

36:3219-3288 (1980); Rappe, ef al.. J. Phvs. Chem. 95:3358-3363 (1991). 

In a preferred embodiment, an explicit penalty is given for buried polar hydrogen atoms which are not 
hydrogen bonded to another atom. See Eisenberg, ef al., (1986) (supra), hereby expressly 

20 incorporated by reference. In a preferred embodiment, this penalty for polar hydrogen burial, is from 
about 0 to about 3 kcal/mol, with from about 1 to about 3 being preferred and 2 kcal/mol being 
particularly preferred. This penalty is only applied to buried polar hydrogens not involved in hydrogen 
bonds. A hydrogen bond is considered to exist when E HB ranges from about 1 to about 4 kcal/mol, 
with E HB of less than -2 kcal/mol being preferred. In addition, in a preferred embodiment, the penalty is 

25 not applied to template hydrogens, i.e. unpaired buried hydrogens of the backbone. 

In a preferred embodiment, only hydrogen bonds between a first rotamerand the backbone are 
scored, and rotamer-rotamer hydrogen bonds are not scored. In an alternative embodiment, hydrogen 
bonds between a first rotamerand the backbone are scored, and rotamer-rotamer hydrogen bonds 
30 are scaled by 0.5 . 
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In a preferred embodiment, the hydrogen bonding scoring function is used for all positions, including 
core, surface and boundary positions. In alternate embodiments, the hydrogen bonding scoring 
function may be used on only one or two of these. 

In a preferred embodiment, a secondary structure propensity scoring function is used. This is based 
on the specific amino acid side chain, and is conformation independent. That is, each amino acid has 
a certain propensity to take on a secondary structure, either cc-helix or (3-sheet, based on its (J> and ip 
angles. See Munoz etal., Current Op. in Biotech. 6:382 (1995); Minor, etal., Nature 367:660-663 
(1994); Padmanabhan, etal., Nature 344:268-270 (1990); Munoz, etal.. Folding & Design K3V167- 
178 (1996); and Chakrabartty, etal., Protein Sci. 3:843 (1994), all of which are expressly incorporated 
herein by reference. Thus, for variable residue positions that are in recognizable secondary structure 
in the backbone, a secondary structure propensity scoring function is preferably used. That is, when a 
variable residue position is in an a-helical area of the backbone, the a-helical propensity scoring 
function described below is calculated. Whether or not a position is in a a-helical area of the 
backbone is determined as will be appreciated by those in the art, generally on the basis of (J) and ijj 
angles; for oc-helix, $ angles from -2 to -70 and ip angles from -30 to -100 generally describe an a- 
helical area of the backbone. 

Similarly, when a variable residue position is in a (3-sheet backbone conformation, the (3-sheet 
propensity scoring function is used. (3-sheet backbone conformation is generally described by (J) 
angles from -30 to -100 and x angles from +40 to +180. In alternate preferred embodiments, variable 
residue positions which are within areas of the backbone which are not assignable to either (3-sheet or 
a-helix structure may also be subjected to secondary structure propensity calculations. 

In a preferred embodiment, energies associated with secondary propensities are calculated using 
Equation 14: 

Equation 14 
E = io N ^ (AG °^ AGOa]a) - 1 

In Equation 14, E a (or E p ) is the energy of a-helical propensity, AG° aa is the standard free energy of 
helix propagation of the amino acid, and AG° ala is the standard free energy of helix propagation of 
alanine used as a standard, or standard free energy of [3-sheet formation of the amino acid, both of 
which are available in the literature (see Chakrabartty, et al., (1994) (supra), and Munoz, et a/., (1996) 
(supra)), both of which are expressly incorporated herein by reference), and N ss is the propensity scale 
factor which is set to range from 1 to 4, with 2.0 being preferred. This potential is preferably selected 
in order to scale the propensity energies to a similar range as the other terms in the scoring function. 
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In a preferred embodiment, (3-sheet propensities are preferably calculated only where the i-1 and i+1 
residues are also in P-sheet conformation. 

In a preferred embodiment, the secondary structure propensity scoring function is used only in the 
energy calculations for surface variable residue positions. In alternate embodiments, the secondary 
structure propensity scoring function is used in the calculations for core and boundary regions as well. 

In a preferred embodiment, an electrostatic scoring function is used, as shown below in Equation 15: 



In this Equation, q is the charge on atom 1 , q' is charge on atom 2, and r is the interaction distance. 

In a preferred embodiment, a rotamer internal strain energy potential scoring function is used. As 
appreciated by those of skill in the art, for a given rotamer, a number of different internal interactions 

may occur between different atoms that contribute to designed protein stability. The contribution of 

these internal interactions may be determined by using one or more of the following scoring functions: 
1) van der Waals scoring function; 2) Coulumbic potential; 3) hydrogen bond potential; and d) torsional 
potential. 

In a preferred embodiment, at least one scoring function is used for each variable residue position; in 
preferred embodiments, two, three or four scoring functions are used for each variable residue 
position. 

Once the scoring functions to be used are identified for each variable position, the preferred first step 
in the computational analysis comprises the determination of the interaction of each possible rotamer 
with all or part of the remainder of the protein. That is, the energy of interaction, as measured by one 
or more of the scoring functions, of each possible rotamer at each variable residue position with either 
the backbone or other rotamers, is calculated. In a preferred embodiment, the interaction of each 
rotamer with the entire remainder of the protein, i.e. both the entire template and all other rotamers, is 
done. However, as outlined above, it is possible to only model a portion of a protein, for example a 
domain of a larger protein, and thus in some cases, not all of the protein need be considered. 

In a preferred embodiment, the first step of the computational processing is done by calculating two 
sets of interactions for each rotamer at every position (Figure 3A): the interaction of the rotamer side 
chain with the template or backbone (the "singles" energy), and the interaction of the rotamer side 
chain with all other possible rotamers at every other position (the "doubles" energy), whether that 
position is varied or floated. It should be understood that the backbone in this case includes both the 
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atoms of the protein structure backbone, as well as the atoms of any fixed residues, wherein the fixed 
residues are defined as a particular conformation of an amino acid. 

Thus, "singles" (rotamer/template) energies are calculated for the interaction of every possible rotamer 
at every variable residue position with the backbone, using some or all of the scoring functions. Thus, 
for the hydrogen bonding scoring function, every hydrogen bonding atom of the rotamer and every 
hydrogen bonding atom of the backbone is evaluated, and the E HB is calculated for each possible 
rotamer at every variable position. Similarly, for the van der Waals scoring function, every atom of the 
rotamer is compared to every atom of the template (generally excluding the backbone atoms of its 
own residue), and the E vdW is calculated for each possible rotamer at every variable residue position. 
In addition, generally no van der Waals energy is calculated if the atoms are connected by three bonds 
or less. For the atomic solvation scoring function, the surface of the rotamer is measured against the 
surface of the template, and the E as for each possible rotamer at every variable residue position is 
calculated. The secondary structure propensity scoring function is also considered as a singles 
energy, and thus the total singles energy may contain an E SB term. As will be appreciated by those in 
the art, many of these energy terms will be close to zero, depending on the physical distance between 
the rotamer and the template position; that is, the farther apart the two moieties, the closer to zero. 

Accordingly, as outlined above, the total singles energy is the sum of the energy of each scoring 
function used at a particular position, as shown in Equation 1, wherein n is either 1 or zero, depending 
on whether that particular scoring function was used at the rotamer position: 

Equation 1 

E total = nE vdw + nE as + nE h _ bonding + nE ss + nE elec 

Once calculated, each singles E tota | for each possible rotamer is stored in the memory 24 within the 
computer, such that it may be used in subsequent calculations, as outlined below. 

For the calculation of "doubles" energy (rotamer/rotamer), the interaction energy of each possible 
rotamer is compared with every possible rotamer at all other variable residue positions. Thus, 
"doubles" energies are calculated for the interaction of every possible rotamer at every variable 
residue position with every possible rotamer at every other variable residue position, using some or all 
of the scoring functions. Thus, for the hydrogen bonding scoring function, every hydrogen bonding 
atom of the first rotamer and every hydrogen bonding atom of every possible second rotamer is 
evaluated, and the E HB is calculated for each possible rotamer pair for any two variable positions. 
Similarly, for the van der Waals scoring function, every atom of the first rotamer is compared to every 
atom of every possible second rotamer, and the E vdW is calculated for each possible rotamer pair at 
every two variable residue positions. For the atomic solvation scoring function, the surface of the first 
rotamer is measured against the surface of every possible second rotamer, and the E as for each 
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possible rotamer pair at every two variable residue positions is calculated. The secondary structure 
propensity scoring function need not be run as a "doubles" energy, as it is considered as a component 
of the "singles" energy. As will be appreciated by those in the art, many of these double energy terms 
will be close to zero, depending on the physical distance between the first rotamer and the second 
5 rotamer; that is, the farther apart the two moieties, the closer to zero. 

Accordingly, as outlined above, the total doubles energy is the sum of the energy of each scoring 
function used to evaluate every possible pair of rotamers, as shown in Equation 16, wherein n is either 
1 or zero, depending on whether that particular scoring function was used at the rotamer position: 

10 

Equation 16 

E to t a i = nE vdw + nE as + nE h . bonding + E eiec 

An example is illuminating. A first variable position, i, has three (an unrealistically low number) 
1 5 possible rotamers (which may be either from a single amino acid or different amino acids) which are 
labelled i a , i b , and i c . A second variable position, j, also has three possible rotamers, labelled j d , j e , and 
j f . Thus, nine doubles energies (E total ) are calculated in all: E total (i a , j d ), E tolal (i a , je ), E total (i a , j f ), E total (i b , j d ), 
E total (i b , j e ), E total (i b , j f ), E total (i c , j d ), E total (i c , j e ), and E tota ,(i c , j f ). 

20 Once calculated, each doubles E total for each possible rotamer pair is stored in memory 24 within the 
computer, such that it may be used in subsequent calculations, as outlined below. 

Once the singles and doubles energies are calculated and stored, the next step of the computational 
processing may occur. Generally speaking, the goal of the computational processing is to determine a 
25 set of optimized protein sequences. By "optimized protein sequence" herein is meant a sequence 
that best fits the mathematical equations herein. As will be appreciated by those in the art, a global 
optimized sequence is the one sequence that best fits Equation 1 , i.e. the sequence that has the 
lowest energy of any possible sequence. However, there are any number of sequences that are not 
the global minimum but that have low energies. 

30 

In a preferred embodiment, the set comprises the globally optimal sequence in its optimal 
conformation, i.e. the optimum rotamer at each variable position. That is, computational processing is 
run until the simulation program converges on a single sequence which is the global optimum. 

35 In a preferred embodiment, the set comprises at least two optimized protein sequences. Thus for 

example, the computational processing step may eliminate a number of disfavored combinations but 
be stopped prior to convergence, providing a set of sequences of which the global optimum is one. In 
addition, further computational analysis, for example using a different method, may be run on the set, 
to further eliminate sequences or rank them differently. Alternatively, as is more fully described below, 
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the global optimum may be reached, and then further computational processing may occur, which 
generates additional optimized sequences in the neighborhood of the global optimum. 



If a set comprising more than one optimized protein sequences is generated, they may be rank 
ordered in terms of theoretical quantitative stability, as is more fully described below. 

In a preferred embodiment, the computational processing step first comprises an elimination step, 
sometimes referred to as "applying a cutoff', either a singles elimination or a doubles elimination. 
Singles elimination comprises the elimination of all rotamers with template interaction energies of 
greater than about 10 kcal/mol prior to any computation, with elimination energies of greater than 
about 15 kcal/mol being preferred and greater than about 25 kcal/mol being especially preferred. 
Similarly, doubles elimination is done when a rotamer has interaction energies greater than about 10 
kcal/mol with all rotamers at a second residue position, with energies greater than about 15 being 
preferred and greater than about 25 kcal/mol being especially preferred. 

In a preferred embodiment, the computational processing comprises direct determination of total 
sequence energies, followed by comparison of the total sequence energies to ascertain the global 
optimum and rank order the other possible sequences, if desired. The energy of a total sequence is 
shown below in Equation 17: 

Equation 17 

-^total protein = ^(b-b) + X) + X/ ^(i a ,j a ) 

all i all i all j paiis 

Thus every possible combination of rotamers may be directly evaluated by adding the backbone- 
backbone (sometimes referred to herein as template-template) energy (E (b _ b) which is constant over all 
sequences herein since the backbone is kept constant), the singles energy for each rotamer (which 
has already been calculated and stored), and the doubles energy for each rotamer pair (which has 
already been calculated and stored). Each total sequence energy of each possible rotamer sequence 
can then be ranked, either from best to worst or worst to best. This is obviously computationally 
expensive and becomes unwieldy as the length of the protein increases. 

In a preferred embodiment, the computational processing includes one or more Dead-End Elimination 
(DEE) computational steps. The DEE theorem is the basis for a very fast discrete search program 
that was designed to pack protein side chains on a fixed backbone with a known sequence. See 
Desmet, etal., Nature 356:539-542 (1992); Desmet, era/., The Proetin Folding Problem and Tertiary 
Structure Prediction , Ch. 10:1-49 (1994); Goldstein, Biophvs. Jour. 66:1335-1340 (1994), all of which 
are incorporated herein by reference. DEE is based on the observation that if a rotamer can be 
eliminated from consideration at a particular position, i.e. make a determination that a particular 
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rotamer is definitely not part of the global optimal conformation, the size of the search is reduced. This 
is done by comparing the worst interaction (i.e. energy or E total ) of a first rotamer at a single variable 
position with the best interaction of a second rotamer at the same variable position. If the worst 
interaction of the first rotamer is still better than the best interaction of the second rotamer, then the 
5 second rotamer cannot possibly be in the optimal conformation of the sequence. The original DEE 
theorem is shown in Equation 23: 

Equation 23 

E(i a ) + B min over t{E(i a , j t )}] > E(i b ) + £[max over t{E(i b , j t )}] 
10 j j 

In Equation 23, rotamer i a is being compared to rotamer i b (for notational convenience i a , and i b are the 
equivalents of i r , i t , respectively, in Equation 35). The left side of the inequality is the best possible 
interaction energy (E tota i) of i a with the rest of the protein; that is, "min over t" means find the rotamer t 
1 5 on position j that has the best interaction with rotamer i a . Similarly, the right side of the inequality is the 
worst possible (max) interaction energy of rotamer i b with the rest of the protein. If this inequality is 
true, then rotamer i a is Dead-Ending and can be Eliminated. The speed of DEE comes from the fact 
that the theorem only requires sums over the sequence length to test and eliminate rotamers. 

20 In a preferred embodiment, a variation of DEE is performed. Goldstein DEE, based on Goldstein, 

(1994) (supra), hereby expressly incorporated by reference, is a variation of the DEE computation, as 
shown in Equation 24: 

Equation 24 

25 E(i a ) - E(i b ) + £[min over t{E(i a , j t ) - E(i b , j t )}] > 0 

In essence, the Goldstein Equation 24 says that a first rotamer a of a particular position 
i (rotamer i a ) will not contribute to a local energy minimum if the energy of conformation with i a can 
always be lowered by just changing the rotamer at that position to i b , keeping the other residues equal 
30 (for notational convenience i a , and i b are the equivalents of i r , /' f , respectively, in Equation 36). If this 
inequality is true, then rotamer i a is Dead-Ending and can be Eliminated. 

Thus, in a preferred embodiment, a first DEE computation is done where rotamers at a single variable 
position are compared, ("singles" DEE) to eliminate rotamers at a single position. This analysis is 

35 repeated for every variable position, to eliminate as many single rotamers as possible. In addition, 
every time a rotamer is eliminated from consideration through DEE, the minimum and maximum 
calculations of Equation 23, depending on which DEE variation is used, thus conceivably allowing the 
elimination of further rotamers. Accordingly, the singles DEE computation can be repeated until no 
more rotamers can be eliminated; that is, when the inequality is not longer true such that all of them 

40 could conceivably be found on the global optimum. 
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In a preferred embodiment, "doubles" DEE is additionally done. In doubles DEE, pairs of rotamers are 
evaluated; that is, a first rotamer at a first position and a second rotamer at a second position are 
compared to a third rotamer at the first position and a fourth rotamer at the second position, either 
using original or Goldstein DEE. Pairs are then flagged as nonallowable, although single rotamers 
cannot be eliminated, only the pair. Again, as for singles DEE, every time a rotamer pair is flagged as 
nonallowable, the minimum calculations of Equation 24 change (depending on which DEE variation is 
used) thus conceivably allowing the flagging of further rotamer pairs. Accordingly, the doubles DEE 
computation can be repeated until no more rotamer pairs can be flagged. 

As is apparent from equations (17) and (24), the performance of any DEE algorithm depends heavily 
on the nature of the physical model used to compute the singles energy (i.e. rotamer-backbone; 
defined above) and the doubles energy (i.e. rotamer-rotamer; defined above) terms E(i a ) and E(i a ,j t ), 
respectively in equation (17). Potential scoring functions that emphasize energy terms that contribute 
to E(/ a ) relative to E(iJ t ) will result in less coupling and easier optimization. In the limit of 
I E(i a ) | » | E(iJ t ) | , the optimization reduces to the selection of the rotamer with the best one-body 
energy at each residue position. 

In a preferred embodiment, optimization with a potential scoring function modified to emphasize 
singles energy terms is used. Scoring functions that can be modified to favor E(/ a ) include, but are not 
limited to: (1) atomic solvation potential; (2) Coulombic potential with a non-distance-dependent 
dielectric constant for rotamer/backbone interactions and a distance-dependent dielectric constant for 
rotamer/rotamer interactions; (3) rotamer internal strain energy; (4) secondary structure propensities 
for helical and beta-strand positions; and, (5) van der Waals energies. 

in addition, in a preferred embodiment, rotamer pairs are initially prescreened to eliminate rotamer 
pairs prior to DEE. This is done by doing relatively computationally inexpensive calculations to 
eliminate certain pairs up front. This may be done in several ways, as is outlined below. 

To search exhaustively for all dead-ending rotamers at a residue position /, it is necessary to compare 
every rotamer to every other rotamer available at /. In a comparison matrix, each column corresponds 
to a particular rotamer, i n as a candidate rfor elimination, and each row corresponds to one of the 
possible reference rotamers If there are n t rotamers at position /, then an exhaustive search of n 2 -n 
matrix elements is necessary. Such a matrix is evaluated for each of the positions that may be 
represented by /. 

In a preferred embodiment, the rotamer pair with the lowest interaction energy with the rest of the 
system is found. Inspection of the energy distributions in sample comparison matrices has revealed 
that an i j v pair that dead-end eliminates a particular i r j s pair can also eliminate other ij s pairs. In fact, 
there are often a few i j v pairs, which we call "magic bullets," that eliminate a significant number of i r j s 
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pairs. We have found that one of the most potent magic bullets is the pair for which maximum 
interaction energy, e max ([ij v ])k t , is least (see Equations 29-31). This pair is referred to as {ij v ) mb . If this 
rotamer pair is used in the first round of doubles DEE, it tends to eliminate pairs faster. 

Our first speed enhancement is to evaluate the first-order doubles calculation for only the matrix 
elements in the row corresponding to the (i j v ) mb pair. The discovery of (ij v ) mb is an n 2 calculation (n = 
the number of rotamers per position), and the application of Equation 24 to the single row of the matrix 
corresponding to this rotamer pair is another n 2 calculation, so the calculation time is small in 
comparison to a full Goldstein calculation, in practice, this calculation produces a large number of 
dead-ending pairs, often enough to proceed to the next iteration of singles elimination without any 
further searching of the doubles matrix. 

The magic bullet Goldstein calculation will also discover all dead-ending pairs that would be 
discovered by the Equation 23 or 24, thereby making it unnecessary. This stems from the fact that 
e ma x((Uv)mb) must be less than or equal to any e max ([ij v ]) that would successfully eliminate a pair by 
Equations 23 or 24. 

Since the minima and maxima of any given pair has been precalculated as outlined herein, a second 
speed-enhancement precalculation may be done. By comparing extrema, pairs that will not dead end 
can be identified and thus skipped, reducing the time of the DEE calculation. Thus, pairs that satisfy 
either one of the following criteria are skipped: 

Equation 25 
e^dy,] Xe^t [i a j v ] ) 



Equation 26: 

Because the matrix containing these calculations is symmetrical, half of its elements will satisfy the 
first inequality Equation 25, and half of those remaining will satisfy the other inequality Equation 26. 
These three quarters of the matrix need not be subjected to the evaluation of Equation 23 or 24, 
resulting in a theoretical speed enhancement of a factor of four. 

The last DEE speed enhancement refines the search of the remaining quarter of the matrix. This is 
done by constructing a metric from the precomputed extrema to detect those matrix elements likely to 
result in a dead-ending pair. 
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A metric was found through analysis of matrices from different sample optimizations. We searched for 
combinations of the extrema that predicted the likelihood that a matrix element would produce a dead- 
ending pair. Interval sizes (see Figure 4) for each pair were computed from differences of the 
extrema. The size of the overlap of the i r j s and i u j v intervals were also computed, as well as the 
5 difference between the minima and the difference between the maxima. Combinations of these 

quantities, as well as the lone extrema, were tested for their ability to predict the occurrence of dead- 
ending pairs. Because some of the maxima were very large, the quantities were also compared 
logarithmically. 

1 0 Most of the combinations were able to predict dead-ending matrix elements to varying degrees. The 
best metrics were the fractional interval overlap with respect to each pair, referred to herein as q re and 

duv 

Equation 27 

_ interval overlap _ e ma x ( ) " g min t ^Js^ > 

1 qrs interval ([i r j J) e max ( [ i r j s ] ) -e min ( [ i r jj ) 

15 

Equation 28 

_ interval overlap _ g max ( ^ i Jv^ > ~ e min( } 
q « v interval ( [ij v ] ) ( [ij v ] ) - &mLn ( [ij v ] ) 

These values are calculated using the minima and maxima equations 29, 30, 31 and 32 (see Figure 
5): 

Equation 29 

e max ( [i r J s ] ) =e( [i r J s ] > + ^ 2 maxe ( [i r j s ] ,k t 

20 

Equation 30 

e min ( [i r j s ] ) =e([i r j s ] ) + s mine( [i r j s ] ,k t ) 

k*i*j t 

Equation 31 
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e ( [i jJ ) =e ( [i j ] ) + S maxe ([ i j v ] , k t ) 

k*±*j t 

Equation 32 

e .( [i j ] ) =e( [i u j v ] ) + S mine ( [i u j v ] , k t ) 

k*i*j t 

These metrics were selected because they yield ratios of the occurrence of dead-ending matrix 
elements to the total occurrence of elements that are higher than any of the other metrics tested. For 
example, there are very few matrix elements (-2%) for which q rs > 0.98, yet these elements produce 
5 30-40% of all of the dead-ending pairs. 

Accordingly, the first-order doubles criterion is applied only to those doubles for which q rs > 0.98 and 
q uv > 0.99. The sample data analyses predict that by using these two metrics, as many as half of the 
O dead-ending elements may be found by evaluating only two to five percent of the reduced matrix. 
10. 

In a preferred embodiment for practicing automated protein design, simple spit DEE (s = 1; Equation 
CI 38) is used to find the GMEC. 

In a preferred embodiment for practicing automated protein design, split singles DEE (s > 1; Equation 
18 39) is used to find the GMEC. 

f|| In a preferred embodiment for practicing automated protein design, split singles DEE flags with split 
ff' flags (s=1 ) is used to find the GMEC. 

20 In a preferred embodiment, split singles DEE with split flags (s>2) with or without magic bullet metric 
(equation 40) is used to find the GMEC. 

In a preferred embodiment for practicing automated protein design, magic bullet metric (Equation 40) 
is used to find the GMEC. 

25 

Generally, as is more fully described below, singles and doubles DEE, using either or both of original 
DEE and Goldstein DEE, simple split DEE (s = 1), split singles DEE (s > 1), and magic bullet metric, is 
run iteratively until no further elimination is possible. Usually, convergence is not complete, and further 
elimination must occur to achieve convergence. 

30 

In a preferred embodiment, additional DEE computation is done by the creation of "super residues" or 
"unification", as is generally described in Desmet , Nature 356:539-542 (1992); Desmet, et a/., The 
Protein Folding Problem and Tertiary Structure Prediction , Ch. 10:1-49 (1994); Goldstein, era/., supra. 
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A super residue is a combination of two or more variable residue positions which is then treated as a 
single residue position. The super residue is then evaluated in singles DEE, and doubles DEE, with 
either other residue positions or super residues. The disadvantage of super residues is that there are 
many more rotameric states which must be evaluated; that is, if a first variable residue position has 5 
5 possible rotamers, and a second variable residue position has 4 possible rotamers, there are 20 

possible super residue rotamers which must be evaluated. However, these super residues may be 
eliminated similar to singles, rather than being flagged like pairs. 

The selection of which positions to combine into super residues may be done in a variety of ways. In 
10 general, random selection of positions for super residues results in inefficient elimination, but it can be 
done, although this is not preferred. In a preferred embodiment, the first evaluation is the selection of 
positions for a super residue is the number of rotamers at the position. If the position has too many 
rotamers, it is never unified into a super residue, as the computation becomes too unwieldy. Thus, 
only positions with fewer than about 100,000 rotamers are chosen, with less than about 50,000 being 
15 preferred and less than about 10,000 being especially preferred. 

|| In a preferred embodiment, the evaluation of whether to form a super residue is done as follows. All 

£ possible rotamer pairs are ranked using Equation 33, and the rotamer pair with the highest number is 

chosen for unification: 

y-20 

" j 3 Equation 33 

l A fraction of flagged pairs 

log(number of super rotamers resulting from the potential unification) 

.25 Equation 33 is looking for the pair of positions that has the highest fraction or percentage of flagged 
pairs but the fewest number of super rotamers. That is, the pair that gives the highest value for 

r "" Equation 33 is preferably chosen. Thus, if the pair of positions that has the highest number of flagged 

pairs but also a very large number of super rotamers (that is, the number of rotamers at position i 
times the number of rotamers at position j), this pair may not be chosen (although it could) over a • 
30 lower percentage of flagged pairs but fewer super rotamers. 

In an alternate preferred embodiment, positions are chosen for super residues that have the highest 
average energy; that is, for positions i and j, the average energy of all rotamers for i and all rotamers 
for j is calculated, and the pair with the highest average energy is chosen as a super residue. 

35 

Super residues are made one at a time, preferably. After a super residue is chosen, the singles and 
doubles DEE computations are repeated where the super residue is treated as if it were a regular 
residue. As for singles and doubles DEE, the elimination of rotamers in the super residue DEE will 
alter the minimum energy calculations of DEE. Thus, repeating singles and/or doubles DEE can result 
40 in further elimination of rotamers. 
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Figure 3A is a detailed illustration of the processing operations associated with a ranking module 34 of 
the invention. The calculation and storage of the singles and doubles energies is the first step, 
although these may be recalculated every time. The optional application of a cutoff, where singles or 
doubles energies that are too high are eliminated prior to further processing also may be included. 
5 Either or both of original singles DEE or Goldstein singles DEE may be done, with the elimination of 
original singles DEE being generally preferred. Once the singles DEE is run, original doubles and/or 
Goldstein doubles DEE is run. Super residue DEE is then generally run, either original or Goldstein 
super residue DEE. This preferably results in convergence at a global optimum sequence. As is 
depicted in Figure 3A, after any step any or all of the previous steps can be rerun, in any order. 

10 

The addition of super residue DEE to the computational processing, with repetition of the previous 
DEE steps, generally results in convergence at the global optimum. Convergence to the global 
optimum is guaranteed if no cutoff applications are made, although generally a global optimum is 
achieved even with these steps. In a preferred embodiment, DEE is run until the global optimum 
1 5 sequence is found. That is, the set of optimized protein sequences contains a single member, the 
global optimum. 

I In a preferred embodiment, the various DEE steps are run until a manageable number of sequences is 
found, i.e. no further processing is required. These sequences represent a set of optimized protein 
20 sequences, and they can be evaluated as is more fully described below. Generally, for computational 
* purposes, a manageable number of sequences depends on the length of the sequence, but generally 
ranges from about 1 to about 10 15 possible rotamer sequences. This range can be extended to 
approximately 10 30 if B&T is used as the next analyzing step. 

25 Alternatively, DEE is run to a point, resulting in a set of optimized sequences (in this context, a set of 
remainder sequences) and then further computational processing of a different type may be run. For 
example, in one embodiment, direct calculation of sequence energy as outlined above is done on the 
remainder possible sequences. Alternatively, a Monte Carlo search can be run. In another 
embodiment, B&T can be run. 

30 

In a preferred embodiment, the computation processing need not comprise a DEE computational step. 
In this embodiment, a Monte Carlo search is undertaken, as is known in the art. See Metropolis ef a/., 
J. Chem. Phys. 21:1087 (1953), hereby incorporated by reference. In this embodiment, a random 
sequence comprising random rotamers is chosen as a start point. In one embodiment, the variable 
35 residue positions are classified as core, boundary or surface residues and the set of available residues 
at each position is thus defined. Then a random sequence is generated, and a random rotamer for 
each amino acid is chosen. This serves as the starting sequence of the Monte Carlo search. A Monte 
Carlo search then makes a random jump at one position, either to a different rotamer of the same 
amino acid or a rotamer of a different amino acid, and then a new sequence energy (E tota , sequence ) is 



-44- 



calculated, and if the new sequence energy meets the Boltzmann criteria for acceptance, it is used as 
the starting point for another jump. If the Boltzmann test fails, another random jump is attempted from 
the previous sequence. In this way, sequences with lower and lower energies are found, to generate a 
set of low energy sequences. 

5 

If computational processing results in a single global optimum sequence, it is frequently preferred to 
generate additional sequences in the energy neighborhood of the global solution, which may be 
ranked. These additional sequences are also optimized protein sequences. The generation of 
additional optimized sequences is generally preferred so as to evaluate the differences between the 

10 theoretical and actual energies of a sequence. Generally, in a preferred embodiment, the set of 
sequences is at least about 75% homologous to each other, with at least about 80% homologous 
being preferred, at least about 85% homologous being particularly preferred, and at least about 90% 
being especially preferred. In some cases, homology as high as 95% to 98% is desirable. Homology 
in this context means sequence similarity or identity, with identity being preferred. Identical in this 

15 context means identical amino acids at corresponding positions in the two sequences which are being 
compared. Homology in this context includes amino acids which are identical and those which are 
similar (functionally equivalent). This homology will be determined using standard techniques known 
in the art, such as the Best Fit sequence program described by Devereux, ef a/., Nucl. Acid Res. . 
12:387-395 (1984), or the BLASTX program (Altschul, era/., J. Mol. Biol., 215:403-410 (1990)) 

20 preferably using the default settings for either. The alignment may include the introduction of gaps in 
the sequences to be aligned. In addition, for sequences which contain either more or fewer amino 
acids than an optimum sequence, it is understood that the percentage of homology will be determined 
based on the number of homologous amino acids in relation to the total number of amino acids. Thus, 
for example, homology of sequences shorter than an optimum will be determined using the number of 

25 amino acids in the shorter sequence. 

Once optimized protein sequences are identified, the processing of Figure 2 optionally proceeds to 
step 56 which entails searching the protein sequences. This processing may be implemented with the 
search module 36. The search module 36 is a set of computer code that executes a search strategy. 

30 For example, the search module 36 may be written to execute a Monte Carlo search as described 
above. Starting with the global solution, random positions are changed to other rotamers allowed at 
the particular position, both rotamers from the same amino acid and rotamers from different amino 
acids. A new sequence energy (E, ota | Sequenoe ) is calculated, and if the new sequence energy meets the 
Boltzmann criteria for acceptance, it is used as the starting point for another jump. See Metropolis ef 

35 a/., 1953, supra, hereby incorporated by reference. If the Boltzmann test fails, another random jump is 
attempted from the previous sequence. A list of the sequences and their energies is maintained 
during the search. After a predetermined number of jumps, the best scoring sequences may be 
output as a rank-ordered list. Preferably, at least about 1 0 6 jumps are made, with at least about 1 0 7 
jumps being preferred and at least about 1 0 8 jumps being particularly preferred. Preferably, at least 
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about 100 to 1000 sequences are saved, with at least about 10,000 sequences being preferred and at 
least about 100,000 to 1,000,000 sequences being especially preferred. During the search, the 
temperature is preferably set to 1000 K. 

5 Once the Monte Carlo search is over, all of the saved sequences are quenched by changing the 
temperature to 0 K, and fixing the amino acid identity at each position. Preferably, every possible 
rotamer jump for that particular amino acid at every position is then tried. 

The computational processing results in a set of optimized protein sequences. These optimized 
1 0 protein sequences are generally, but not always, significantly different from the wild-type sequence 
from which the backbone was taken. That is, each optimized protein sequence preferably comprises 
at least about 5-10% variant amino acids from the starting or wild-type sequence, with at least about 
15-20% changes being preferred and at least about 30% changes being particularly preferred. 

15 These sequences can be used in a number of ways. In a preferred embodiment, one, some or all of 
the optimized protein sequences are constructed into designed proteins, as show with step 58 of 
Figure 2. Thereafter, the protein sequences can be tested, as shown with step 60 of the Figure 2. 
~ Generally, this can be done in one of two ways. 

20 In a preferred embodiment, the designed proteins are chemically synthesized as is known in the art. 

This is particularly useful when the designed proteins are short, preferably less than 150 amino acids 
• in length, with less than 100 amino acids being preferred, and less than 50 amino acids being 

particularly preferred, although as is known in the art, longer proteins can be made chemically or 
enzymatically. 

25 

In a preferred embodiment, particularly for longer proteins or proteins for which large samples are 
desired, the optimized sequence is used to create a nucleic acid such as DNA which encodes the 
optimized sequence and which can then be cloned into a host cell and expressed. Thus, nucleic 
acids, and particularly DNA, can be made which encodes each optimized protein sequence. This is 
30 done using well known procedures. The choice of codons, suitable expression vectors and suitable 
host cells will vary depending on a number of factors, and can be easily optimized as needed. 

Once made, the designed proteins are experimentally evaluated and tested for structure, function and 
stability, as required. This will be done as is known in the art, and will depend in part on the original 
35 protein from which the protein backbone structure was taken. Preferably, the designed proteins are 
more stable than the known protein that was used as the starting point, although in some cases, if 
some constraints are placed on the methods, the designed protein may be less stable. Thus, for 
example, it is possible to fix certain residues for altered biological activity and find the most stable 
sequence, but it may still be less stable than the wild type protein. Stable in this context means that 



-46- 



the new protein retains either biological activity or conformation past the point at which the parent 
molecule did. Stability includes, but is not limited to, thermal stability, i.e. an increase in the 
temperature at which reversible or irreversible denaturing starts to occur; proteolytic stability, i.e. a 
decrease in the amount of protein which is irreversibly cleaved in the presence of a particular protease 
5 {including autolysis); stability to alterations in pH or oxidative conditions; chelator stability; stability to 
metal ions; stability to solvents such as organic solvents, surfactants, formulation chemicals; etc. 

In a preferred embodiment, the modeled proteins are at least about 5% more stable than the original 
protein, with at least about 10% being preferred and at least about 20-50% being especially preferred. 

10 

The results of the testing operations may be computationally assessed, as shown with step 62 of 
Figure 2. An assessment module 38 may be used in this operation. That is, computer code may be 
prepared to analyze the test data with respect to any number of matrixes. 

15 At this processing juncture, if the protein is selected (the yes branch at block 64) then the protein is 
utilized (step 66), as discussed below. If a protein is not selected, the accumulated information may 
be used to alter the ranking module 34, and/or step 56 is repeated and more sequences are searched. 

In a preferred embodiment, the experimental results are used for design feedback and design 
20 optimization. 

Once made, the proteins of the invention find use in a wide variety of applications, as will be 
appreciated by those in the art, ranging from industrial to pharmocological uses, depending on the 
protein. Thus, for example, proteins and enzymes exhibiting increased thermal stability may be used 

25 in industrial processes that are frequently run at elevated temperatures, for example carbohydrate 
processing (including saccharification and liquifaction of starch to produce high fructose corn syrup 
and other sweetners), protein processing (for example the use of proteases in laundry detergents, 
food processing, feed stock processing, baking, etc.), etc. Similarly, the methods of the present 
invention allow the generation of useful pharmaceutical proteins, such as analogs of known 

30 proteinaceous drugs which are more thermostable, less proteolytically sensitive, or contain other 
desirable changes. 

The following examples serve to more fully describe the manner of using the above-described 
invention, as well as to set forth the best modes contemplated for carrying out various aspects of the 
35 invention. It is understood that these examples in no way serve to limit the true scope of this invention, 
but rather are presented for illustrative purposes. All references cited herein are explicitly incorporated 
by reference in their entirety. 
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EXAMPLES 



Example 1 
Benchmark Design Cases 

5 

To demonstrate the increased power of split DEE over standard Goldstein DEE, the present study 
presents results for five challenging design problems described in Table I. Case 1 represent the 
design of all 25 non-glycine residues (5, 14, 21, 27, 29, 31, 35, 37, 38, 39, 41, 46, 50, 55, 56, 63, 70, 

72, 74, 80, 82, 84, 92, 96, 98) in the core of plastocyanin (PDB code 2pcy). Case 2 involves all 34 
10 core positions on a novel repeating backbone based on the leucine rich repeat motif; the 17 residues 

in each of two repeats have linked, but unspecified amino acid identiities. Case 3 represents the full 
core design of the variable domains of the light and heavy chains of catalytic antibody 48G7 FAB (PDB 
code 1gaf). This corresponds to residues (2,4, 6, 19, 21, 25, 29, 33, 36-38, 44, 46-48, 55, 58, 62, 71, 

73, 75, 78, 82, 84-87, 89, 90, 95-98, 102, 104) of chain L and residues (4,6, 18, 20, 24, 32, 34-39, 45, 
[l5 47, 48, 50, 51, 53, 61, 64, 68, 70, 72, 77, 79, 81, 83, 86, 90, 92-95, 97, 98, 103, 104, 108, 110) of 

chain H. Case 4 involves the design of all 10 non-glycine core residues (3,, 5, 7, 20, 26, 30, 34, 39, 
52, 54) and all 15 boundary residues (1, 11, 12, 16, 18, 23, 25, 27, 29, 33, 37, 43, 45, 50, 56) of the (31 
domain of Protein G (PDB code 1 pga). Case 5 represent the design of all 27 non-glycine surfaces 
residues (2, 4, 6, 8, 10, 13, 15, 17, 19, 21, 22, 24, 28, 31, 32, 35, 36, 40, 42, 44, 46, 47, 48, 49, 51, 53, 
20 55) of the (31 domain of Protein G. The benchmark calculations were performed on 16 Power3 

processors of an IBM SP3 running at 375 MHz. The CPU minutes for the benchmark design cases is 
shown in Table 2. 



Table 1. Benchmark Design Cases 



Case 


Description 


Type 


Residues 


Rotamers 


Conformation 


1 


Plastocyanin 


Core 


25 


1716 


^ 7x1q38 


2 


Novel backbone 


Linked Core 


34 


674 


8.4 x10 39 


3 


Catalytic antibody 


CORE 


74 


4919 


4.7 x10 128 


4 


31 of Protein G 


Core/boundary 


25 


4295 


4.0 x10 53 


5 


01 of Protein G 


Surface 


27 


4842 


4.9 x 10 60 



Table 2. CPU minutes for the benchmark design cases. 

10 



Case 


Method 


Time (minutes) 


Remaining 








Conformations 




DEE s2mb 


334 


7X1 0 14 




DEE s2 


150 


2X1 0 11 


1 


DEE s2 bound flags 


22 


1 




DEE s2 split flags 


46 


1 




HERO 


13 


1 




DEE s2mb 


250 


1x10 18 




DEE s2 


210 


1x10 18 


2 


DEE s2 bound flags 


23 


1 




DEE s2 split flags 


167 


3x1 0 16 




HERO 


7 


1 




DEE s2mb 


984 


3x10 s 




DEE s2 


687 


1 


3 


DEE s2 bound flags 


663 


1 




DEE s2 split flags 


299 


1 




HERO 


359 


1 




DEE s2mb 


1449 


2x1 0 35 




DEE s2 


1333 


1x10 35 


4 


DEE s2 bound flags 


1688 


1x10 3s 




DEE s2 split flags 


875 


9x1 0 19 




HERO 


476 


1 




DEE s2mb 


292 


3x1 0 16 




DEE s2 


129 


1 


5 


DEE s2 bound flags 


72 


1 




DEE s2 split flags 


46 


1 




HERO 


35 


1 



Example 2 



The HERO Algorithm 



Using a rotameric description of conformational space, the Side Chain Placement Problem of 
Homology Modeling (Desmet, J., et al., (1992) Nature, 365:539) and the Sequence Selection Problem 
5 of Protein Design (Dahiyat, B.I., and Mayo, SL. (1996) Prot. Sci., 5:895) can both be described as the 
following combinatorial optimization problem: 

Choose the single rotamer for each residue position that minimizes the sum of the 
pairwise interaction energies between rotamers at all positions. 

10 

The problem may also be recast as the Flight Pricing Problem: 

For a set of cities each containing multiple airports, choose the airport for each city 
that minimizes the cost of visiting every city from every other city. 

15 

and as The Belief Network Problem (Pearl, J. (1988) "Probabilistic Reasoning in Intelligent Systems", 
Morgan-Kauffman): 

For a graph with nodes representing conditional probabilities and edges representing 
20 dependencies between these probabilities, determine the values at the nodes that 

maximize the sum of these conditional probabilities. 

and the Spin Glass Problem (Mezard, G., et al., (1987), "Spin Glass Theory and Beyond", World 
Scientific): 

25 

For a graph with nodes representing spin states and edges representing coupling 
between spin states at neighboring nodes, find the set of spins that correspond to the 
lowest energy ground state of the system. 

30 The common mathematical structure in all of these problems is that the system is defined by a set of 
candidate solutions at each of a number of nodes. Each candidate solution is described in terms of a 
self-energy (possibly zero) and a set of pairwise interaction energies (possibly zero) with candidate 
solutions at other nodes. The goal is to find a list of unique candidate solutions (one for each node) 
that produces the global optimum of a specified quantity which is based on these self-and pairwise 

35 energies. The HERO algorithm is an exact search algorithm for solving any problem with this 
structure. 
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The HERO Algorithm 

At convergence, this algorithm identifies the single rotamerat each position that belongs to the global 
minimum energy conformation. HERO may be run using simple split singles DEE or split singles DEE 
with split flags. The following cycle is an example of HERO run with split singles DEE. This cycle is 
repeated until the global minimum energy conformation (GMEC) is identified by eliminating all but one 
rotamer at each position: 

1) Iterative simple Goldstein singles DEE (7=1) until no further eliminations; 

2) Iterative simple split singles DEE (s=1) until no further eliminations; 

3) Split single DEE (s > 1 ) with or without magic bullet metric once for each rotamer; 

4) Apply singles bounding criteria to eliminate rotamers whose bounding energy E bomd 
is greater than E low , the lowest known energy of a valid conformation obtained from 
the Monte Carlo searches; 

5) Alternate sequentially between the following, applying one during each cycle: 

a) Magic bullet DEE Goldstein doubles calculation (T=1) to flag dead ending 
pairs; 

b) Monte Carlo search to find a low energy of a valid conformation E low , 

c) Apply doubles bounding criteria to flag pairs whose bounding energy E bound is 
greater than E tolv ; 

d) Goldstein DEE doubles calculation (T=1) to flag dead ending pairs; and, 

e) Unification of any uniquely defined positions, followed by unification of the 
two residues with the highest fraction of dead ending pairs, followed by 
restoration of all flags for the new super-residues; 

6) return to 1 . 

A sample calculation comparing the performance of HERO, run using simple split DEE, to the previous 
state-of-the-art DEE algorithm is shown in Figure 8. The initial number of conformations is 8.4 x 10 39 
and DEE (s =2 mb ) fails to reduce the number of conformations below 1 .0 x 10 20 after more than 6000 
minutes, while HERO converges to the unique minimum conformation in 167 minutes. 

Example 3 

Physical Model Dependence 

The potential function has ben previously described (see Dahiyat, B.I. and Mayo, S.L. (1996) Protein 
Sci., 5:895-903; Dahiyat, B.I. and Mayo, S.L., (1997) Science, 278:82-87) and incorporates terms for 
van der Waals interactions, hydrogen bonds, Coulombic interations and solvation. The backbone- 
dependent rotamer libraries are based on the mean values from the Dunbrack and Karplus library 
preference) with expansion of the Xi and x 2 angles for aromatic residues, the Xi angle for hydrophobic 
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residues and no expansion for polar residues. Canonical values for x 3 and x 4 angles are used for 
amino acids E, Q, K and R. Residues are classified into core, boundary or surface positions (as 
described herein). Core residue identities are selected from among the amino acids A, V, L, I, F, Y 
and W, while surface residue identities are selected form among A, S, T, D, N, H, E, Q, K and R. 
Boundary residue identities are chosen from the union of these sets. 

An example of the dependence of the optimization performance on the underlying physical model is 
shown in Figure 15. This case is a full sequence design of the 56 positions in the (31 domain of 
Protein G. Three of these positions are preset to glycine (position 38 has a positive phi angle and 
functions as a C-cap for the alpha helix; positions 9 and 41 are sterically constrained core positions). 
The remaining positions are divided into core, boundary and surface regions with the allowed amino 
acid identities at each of the 53 positions constrained to preserve the binary pattern of the wild-type 
sequence. The resulting combinatorial complexity is 10 112 conformations with 7775 initial rotamers 
after applying high-energy threshold reduction (HETR; DeMaeyer, M., etal., (1997) Fold. Des., 2:53- 
66) to eliminate rotamers that clash with the backbone. A HERO run with "standard" potential function 
and rotamer library fails to converge after more than 1000 minutes. Optimization with a potential 
function modified to emphasize one-body terms reaches the GMEC in 20 minutes. 



In addition to the potential function component of the physical model, care must be taken with respect 
to the rotamer library. Previous computational work using surprisingly small rotamer libraries 
(approximately 67 rotamers per residue position) showed large, full sequence design problems to be 
tractable (Looger, L.L. and Hellinga, H.W. (2001) J. Mol. Biol., 307:429-445). For the full sequence 
design of Protein G, the average number of rotamers per residue position is 147. Using an 
unexpanded rotamer library and aggressive HETR, the average number of rotamers per position can 
be reduced to 70 (10 80 conformations for 3705 rotamers). Obtaining the GMEC for the resulting 
problem using the standard potential function requires 28 minutes. These results strikingly illustrate 
the dependence of algorithm performance on both the potential function and the rotamer library. 
Clearly, search algorithm performance cannot be meaningfully ascertained on models of uncertain 
biological validity. 
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